Bioinformatics for integrating multi-omic data for cancer research


Research background

Cancer is a complex disease, which is can be dysregulated through multiple mechanisms. Therefore, no single level of genomic data fully elucidates tumor behavior since there are many exceptional variations within or between levels in biological system such as copy number variants, DNA methylation, alternative splicing, miRNA regulation, post translational modification, etc.

Data at the multiple molecular levels, generated from all levels of ¡®omic¡¯ dimensions from genome to phenome, have recently become more available. The Cancer Genome Atlas (TCGA) is a collaborative initiative to improve understanding of cancer using existing large-scale whole-genome technologies. TCGA provides opportunities and challenge to develop computational methods to study cancers based on multiple biological data, which can reveal different aspects and levels of biological system function.

TCGA: Connecting multiple sources, experiments, and data types

Multi-omic data types from TCGA
figure from Nature 455:1061-1068 (2008)


Given multi-levels of data, information from a level to another may lead to some hints that we can uncover an unknown biological knowledge. Thus, integration of different levels of data can aid in extracting knowledge by drawing an integrative conclusion from many pieces of information collected from diverse types of genomic data. In the meantime, it is expected that the next attempt will be more focused on how to utilize the information from inter-relation, the relation between different levels: from the genome level to epigenome, transcriptome, proteome, and further stretched to the phenome level.



  • Genomic data comparison: Which data in more informative?
    Recently, various types of genomic data from cancer patients have become available thanks to the collaborative initiatives in better understanding of cancer. With abundance in genomic/clinical data, the question that bioinformaticians often encounter is which data is more informative. To wet-lab analysts, it concerns data generation that requires highly cost/time-demanding work and experienced facilities. To dry-lab analysts, it concerns selection of appropriate data source for more accurate prediction, avoiding unnecessary waste of computational resource. To provide a preliminary insight on the question, this study compares different types of genomic data using the state-of-the-art machine learning algorithm, Semi-Supervised Learning.

Graph-based Semi-Supervised Learning


  • Synergistic effect of different levels of genomic data for cancer clinical outcome prediction
    There have been many attempts in cancer clinical outcome prediction by suing a dataset from a number of molecular layers of biological system. Despite these efforts, however, it still remains difficult to elucidate the cancer phenotypes because the cancer genome is neither simple nor independent but rather complicated and dysregulated by multiple molecular mechanisms. Recently, heterogeneous types of genomic data, generated from all molecular levels of ¡®omic¡¯ dimensions from genome to phenome, for instance, copy number variants at the genome level, DNA methylation at the epigenome level, and gene expression and microRNA at the transcriptome level, have become available. In this study, we propose an integrated framework that uses multi-layers of heterogeneous genomic data for prediction of clinical outcomes in brain cancer (Glioblastoma multiforme, GBM) and ovarian cancer (Serous cystadenocarcinoma, OV).

    Multi-layers of genomic data in biological system from genome, epigenome, transcriptome and proteome to phenome


  • Combining multi-layers of genomic data and inter-relationship
    The limitation of previous study is integration with multi-layers of genomic data for cancer clinical outcome prediction without considering of inter-relationship information between them. There are possible relationships between the sample features (attributes) belonging to different layers of genomic data such as ¡®miRNA-target genes,¡¯ ¡®copy number alteration region-genes located in the alteration region,¡¯ ¡®DNA methylation site-specific genes regulated by promoter regions,¡¯ etc. Therefore, when integrating multiple genomic data, it will be desirable that a framework will be capable of containing the inter-relationships between sample features belonging to different layers of the biological system. This study can be categorized into three types of sub-studies.
           A. miRNA – mRNA dataset
           B. Copy number alteration – mRNA dataset
           C. DNA methylation – mRNA dataset

    Inter-relationship between different levels of genomic data

    Schematic overview of combining different levels of genomic data and inter-relationship
    (miRNA - gene expression)

  • Kernel-based integration for survival prediction with multi-layers of heterogeneous genomic data
    Another limitation of previous works we suggested above is that we only considered the binary class prediction problems – for examples, short-term vs. long-term survival prediction. Even though survival data has continuous values, we categorized into two classes, short-term and long-term survival, and then, the classification model was conducted. Thus, a new integrative framework is needed to predict continuous values, e.g. survival data, using multi-layers of heterogeneous genomic data.