Scientific Paper Sessions

S1. Multi-Omic Application

Room: Regency Ballroom
Date: Sunday, Oct. 16, 11:00 - 12:15
S1-1: N-of-1-pathways MixEnrich: advancing precision medicine via single-subject analysis in discovering dynamic changes of transcriptomes

Qike Li1-4,§, A. Grant Schissler1-4,§, Vincent Gardeux1-3, Ikbel Achour1-3, Colleen Kenost1-3, Joanne Berghout1-3, Haiquan Li1-3,*, Hao Helen Zhang4,5,*, Yves A. Lussier1-4, 6-7,*

1 Center for Biomedical Informatics and Biostatistics, The University of Arizona, Tucson, AZ, 85721, USA
2 Bio5 Institute, The University of Arizona, Tucson, AZ, 85721, USA
3 Department of Medicine, The University of Arizona, Tucson, AZ, 85721, USA
4 Graduate Interdisciplinary Program in Statistics, The University of Arizona, Tucson, AZ, 85721, USA
5 Department of Mathematics, The University of Arizona, Tucson, AZ, 85721, USA
6 University of Arizona Cancer Center, The University of Arizona, Tucson, AZ, 85721, USA
7 Institute for Genomics and Systems Biology, The University of Chicago, IL 60637, USA
§ Equal contribution


Abstract
Background: Transcriptome analytic tools are commonly used across patient cohorts to develop drugs and predict clinical outcomes. However, as precision medicine pursues more accurate and individualized treatment decisions, these methods are not designed to address single-patient transcriptome analyses. We previously developed and validated the N-of-1-pathways framework using two methods, Wilcoxon and Mahalanobis Distance (MD), for personal transcriptome analysis derived from a pair of samples of a single patient. Although, both methods uncover concordantly dysregulated pathways, they are not designed to detect dysregulated pathways with up- and down- regulated genes (bidirectional dysregulation) that are ubiquitous in biological systems.
Results: We developed N-of-1-pathways MixEnrich, a mixture model followed by a gene set enrichment test, to uncover bidirectional and concordantly dysregulated pathways one patient at a time. We assess its accuracy in a comprehensive simulation study and in a RNA-Seq data analysis of head and neck squamous cell carcinomas (HNSCCs). In presence of bidirectionally dysregulated genes in the pathway or in presence of high background noise, MixEnrich substantially outperforms previous single-subject transcriptome analysis methods, both in the simulation study and the HNSCCs data analysis (ROC Curves; higher true positive rates; lower false positive rates). Bidirectional and concordant dysregulated pathways uncovered by MixEnrich in each patient largely overlapped with the quasi-gold standard compared to other single-subject and cohort-based transcriptome analyses.
Conclusion: The greater performance of MixEnrich presents an advantage over previous methods to meet the promise of providing accurate personal transcriptome analysis to support precision medicine at point of care.

Top

S1-2: An Inference Method from Multi-Layered Structure of Omics

Myungjun Kim1, Yonghyun Nam1, Hyunjung Shin1,*

1 Department of Industrial Engineering, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon 443-749, South Korea

Abstract
Biological system is a multi-layered structure of omics with genome, epigenome, transcriptome, metabolome, proteome, etc., and can be further stretched to clinical/medical layers such as diseasome, drugs, and symptoms. One advantage of omics is that we can figure out an unknown component or its trait by inferring from known omics components. The component can be inferred by the ones in the same level of omics or the ones in different levels. To implement the inference process, an algorithm that can be applied to the multi-layered complex system is required. In this study, we develop a semi-supervised learning algorithm that can be applied to the multi-layered complex system. In order to verify the validity of the inference, it was applied to the prediction problem of disease co-occurrence with a two-layered network composed of symptom-layer and disease-layer. The symptom-disease layered network obtained a fairly high value of AUC, 0.74, which is regarded as noticeable improvement when comparing 0.59 AUC of single-layered disease network. If further stretched to whole layered structure of omics, the proposed method is expected to produce more promising results.

Top

S1-3: Identification of interactions between miRNA and DNA methylation associated with gene expression as potential prognostic markers in bladder cancer

Manu Shivakumar1,§, Younghee Lee2,§, Lisa Bang1, Tullika Garg3, Kyung-ah Sohn4,*, Dokyoon Kim1,5,*

1 Department of Biomedical & Translational Informatics, Geisinger Health System, Danville, Pennsylvania, USA
2 Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, Utah, USA
3 Mowad Urology Department, Geisinger Health System, Danville, Pennsylvania, USA
4 Department of Software and Computer Engineering, Ajou University, Suwon, South Korea
5 The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, USA
§ Equal contribution


Abstract
One of the fundamental challenges in cancer is to detect the regulators of gene expression changes during cancer progression. Through transcriptional silencing of critical cancer-related genes, epigenetic change such as DNA methylation plays a crucial role in cancer. In addition, miRNA, another major component of epigenome, is also a regulator at the post-transcriptional levels that modulate transcriptome changes. However, a mechanistic role of synergistic interactions between DNA methylation and miRNA as epigenetic regulators on transcriptomic changes and its association with clinical outcomes such as survival have remained largely unexplored in cancer. In this study, we propose an integrative framework to identify epigenetic interactions between methylation and miRNA associated with transcriptomic changes. To test the utility of the proposed framework, the bladder cancer data set, including DNA methylation, miRNA expression, and gene expression data, from The Cancer Genome Atlas (TCGA) was analyzed for this study. First, we found 120 genes associated with interactions between the two epigenomic components. Then, 11 significant epigenetic interactions between miRNA and methylation, which target E2F3, CCND1, UTP6, CDADC1, SLC35E3, METRNL, TPCN2, NACC2, VGLL4, and PTEN, were found to be associated with survival. To this end, exploration of TCGA bladder cancer data identified epigenetic interactions that are associated with survival as potential prognostic markers in bladder cancer. Given the importance and prevalence of these interactions of epigenetic events in bladder cancer it is timely to understand further how different epigenetic components interact and influence each other.

Top



S2. Disease Genomics

Room: Terrace Ballroom
Date: Sunday, Oct. 16, 11:00 - 12:15
S2-1: Knowledge-driven binning approach for rare variant association analysis: Application to neuroimaging biomarkers in Alzheimer's disease

Dokyoon Kim1,2, Anna O. Basile2, Lisa Bang1, Emrin Horgusluoglu4, Seunggeun Lee3, Marylyn D. Ritchie1,2, Andrew J. Saykin4, Kwangsik Nho4,*, for the Alzheimer's Disease Neuroimaging Initiative (ADNI)

1 Department of Biomedical & Translational Informatics, Geisinger Health System, Danville, Pennsylvania, USA
2 The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania, USA
3 Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, USA
4 Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA


Abstract
Background: Rapid advancement of next generation sequencing technologies such as whole genome sequencing (WGS) has facilitated the search for genetic factors that influence disease risk in the field of human genetics. To identify rare variants associated with human diseases or traits, an efficient genome-wide binning approach is needed. In this study we developed a novel biological knowledge-based binning approach for rare-variant association analysis and then applied the approach to structural neuroimaging endophenotypes related to late-onset Alzheimer's disease (LOAD).
Methods: For rare-variant analysis, we used the knowledge-driven binning approach implemented in Bin-KAT, an automated tool, that provides 1) binning/collapsing methods for multi-level variant aggregation with a flexible, biologically informed binning strategy and 2) an option of performing unified collapsing and statistical rare variant analyses in one tool. A total of 750 non-Hispanic Caucasian participants from the Alzheimer`s Disease Neuroimaging Initiative (ADNI) cohort who had both WGS data and magnetic resonance imaging (MRI) scans were used in this study. Mean bilateral cortical thickness of the entorhinal cortex extracted from MRI scans was used as an AD-related neuroimaging endophenotype. SKAT was used for a genome-wide gene- and region-based association analysis of rare variants (MAF (minor allele frequency) < 0.05) and potential confounding factors (age, gender, years of education, intracranial volume (ICV), and MRI field strength) for entorhinal cortex thickness were used as covariates. Significant associations were determined using FDR adjustment for multiple comparisons.
Results: Our knowledge-driven binning approach identified 16 functional exonic rare variants in FANCC significantly associated with entorhinal cortex thickness (FDR-corrected p-value < 0.05). In addition, the approach identified 7 evolutionary conserved regions, which were mapped to FAF1, RFX7, LYPLAL1, and GOLGA3, significantly associated with entorhinal cortex thickness (FDR-corrected p-value < 0.05). In further analysis, the functional exonic rare variants in FANCC were also significantly associated with hippocampal volume and cerebrospinal fluid (CSF) Aβ1-42 (p-value < 0.05).
Conclusions: Our novel binning approach identified rare variants in FANCC as well as 7 evolutionary conserved regions significantly associated with a LOAD-related neuroimaging endophenotype. FANCC (fanconi anemia complementation group C) has been shown to modulate TLR and p38 MAPK-dependent expression of IL-1β in macrophages. Our results warrant further investigation in a larger independent cohort and demonstrate that the biological knowledge-driven binning approach is a powerful strategy to identify rare variants associated with AD and other complex disease.

Top

S2-2: Genotype Based Disease Similarity Matrix from Uniqueness of Shared Genes

Matthew Carson1, Cong Liu2, Yao Lu3, Caiyan Jia4, Hui Lu2,3,5

1 Northwestern University, USA
2 Department of Bioengineering, University of Illinois at Chicago, USA
3 Center for Biomedical Informatics, Shanghai Children's Hospital, China
4 Department of Computer Science, Beijing Jiaotong University, China
5 SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, China


Abstract
Diseases could be related to each other based on shared cause or symptoms. It has been long exploited to treat similar diseases with the similar therapies and drugs. Researchers have been exploring the disease similarities based on their shared genotype or phenotype data. Here we attempt to improve the similarity search by incorporating the uniqueness of the genes shared different diseases by construct a disease similarity matrix based on shared genes and their uniqueness defined in OMIM and DORIF annotation. By further investigating the resulting clusters, we identified several interesting links such as cancer and malaria. Our similarity matrix can be used to identify potential disease relationships and to motivate further studies into the elucidation of causal mechanisms in diseases.

Top

S2-3: Association analysis of rare variants near the APOE region with CSF and neuroimaging biomarkers of Alzheimer's disease

Kwangsik Nho1,3,12,*, Sungeun Kim1,3,12, Emrin Horgusluoglu2, Shannon L. Risacher1,12, Li Shen1,3,12, Dokyoon Kim13, Seunggeun Lee14, Tatiana Foroud1,2,3,12, Leslie M. Shaw4, John Q. Trojanowski4, Paul S. Aisen5, Ronald C. Petersen6, , Clifford R. Jack, Jr.7, Michael W. Weiner8,9, Robert C. Green10, Arthur W. Toga11, and Andrew J. Saykin1,2,3,12,*, for the Alzheimer's Disease Neuroimaging Initiative (ADNI)

1 Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, IN, USA
2 Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
3 Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, USA
4 Department of Pathology and Laboratory Medicine, University of Pennsylvania School of Medicine, Philadelphia, PA, USA
5 Department of Neuroscience, University of California-San Diego, San Diego, CA, USA
6 Department of Neurology, Mayo Clinic Minnesota, Rochester, MN, USA
7 Department of Radiology, Mayo Clinic Minnesota, Rochester, MN, USA
8 Departments of Radiology, Medicine, and Psychiatry, University of California-San Francisco, San Francisco, CA, USA
9 Department of Veterans Affairs Medical Center, San Francisco, CA, USA
10 Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
11 The Institute for Neuroimaging and Informatics and Laboratory of Neuro Imaging, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA
12 Indiana Alzheimer's Disease Center, Indiana University School of Medicine, Indianapolis, IN, USA
13 Department of Biomedical and Translational Informatics, Geisinger Health System, Danville, PA, USA
14 Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor, MI, USA


Abstract
Background: The APOE ε4 allele is the most significant common genetic risk factor for late-onset Alzheimer`s disease (LOAD). The region surrounding APOE on chromosome 19 has also shown consistent association with LOAD. However, no common variants in the region remain significant after adjusting for APOE genotype. We report a rare variant association analysis of genes in the vicinity of APOE with cerebrospinal fluid (CSF) and neuroimaging biomarkers of LOAD.
Methods: Whole genome sequencing (WGS) was performed on 817 blood DNA samples from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Sequence data from 757 non-Hispanic Caucasian participants was used in the present analysis. We extracted all rare variants (MAF (minor allele frequency) < 0.05) within a 312 kb window in APOE’s vicinity encompassing 12 genes. We assessed CSF and neuroimaging (MRI and PET) biomarkers as LOAD-related quantitative endophenotypes. Gene-based analyses of rare variants were performed using the optimal Sequence Kernel Association Test (SKAT-O).
Results: A total of 3,334 rare variants (MAF < 0.05) were found within the APOE region. Among them, 72 rare non-synonymous variants were observed. Eight genes spanning the APOE region were significantly associated with CSF Aβ1-42 (p<1.0x10-3). After controlling for APOE genotype and adjusting for multiple comparisons, 4 genes (CBLC, BCAM, APOE, and RELB) remained significant. Whole-brain surface-based analysis identified highly significant clusters associated with rare variants of CBLC in the temporal lobe region including the entorhinal cortex, as well as frontal lobe regions. Whole-brain voxel-wise analysis of amyloid PET identified significant clusters in the bilateral frontal and parietal lobes showing associations of rare variants of RELB with cortical amyloid burden.
Conclusions: Rare variants within genes spanning the APOE region are significantly associated These findings warrant further investigation and illustrate the role of next generation sequencing and quantitative endophenotypes in assessing rare variants which may help explain missing heritability in AD and other complex diseases.

Top



S3. Cancer Bioinformatics

Room: Regency Ballroom
Date: Sunday, Oct. 16, 14:00 - 16:40
S3-1: Prediction of Recurrent Regulatory Mutations in Noncoding Cancer Genomes

Woojin Yang1, Hyoeun Bang1, Kiwon Jang1, Min Kyung Sung1, Jung Kyoon Choi1,*

1 Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea

Abstract
One of the greatest challenges in cancer genomics is to distinguish driver mutations from passenger mutations. Whereas recurrence is a hallmark of driver mutations, it is difficult to observe recurring noncoding mutations owing to a limited amount of whole-genome sequenced samples. We therefore developed a machine learning method to predict potentially recurrent mutations. In this work, we develop a random forest classifier that aims to predict regulatory mutations that may recur by learning the features of the mutations repeatedly appearing in a given cohort. With breast cancer as a model, we profiled 35 quantitative features describing genetic and epigenetic signals at the mutation site, transcription factors effected by the mutation, and genes targeted by long-range chromatin interactions. A true set of mutations for machine learning was generated by interrogating pan-cancer genomes based on our statistical model. The performance of our random forest classifier was evaluated by cross validations and showed an area under the curve of ~0.78. The variable importance of each feature in the classification of mutations was investigated. Chromatin accessibility at the mutation sites, the distance from the mutations to known cancer risk loci, and the role of the target genes in the regulatory or interaction network were among the most important variables in the classification. In conclusion, our methods enable to characterize recurrent regulatory mutations using a limited number of whole-genome samples, and based on the characterization, to predict potential driver mutations whose recurrence is not found in the given samples but likely observed with additional samples.

Top

S3-2: Identifying subtype-specific gene expressions explained by DNA methylation patterns in breast cancer

Garam Lee1, Lisa Bang2, So Yeon Kim1, Dokyoon Kim2,3,*, Kyung-Ah Sohn1,*

1 Department of Software and Computer Engineering, Ajou University, Suwon 16499, South Korea
2 Department of Biomedical & Translational Informatics, Geisinger Health System, Danville, PA, USA
3 The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA


Abstract
Breast cancer is a complex disease in which different genomic patterns exists depending on different subtypes. Recent researches present that multiple subtypes of breast cancer occur at different rates, and plays a crucial role in planning treatment. For understanding genomic mechanisms underlying breast cancer subtypes, investigating the specific gene regulatory system via different subtypes is desirable. In this paper, gene expression, as an intermediate phenotype, is estimated based on methylation profiles to identify the impact of epigenome on transcriptome in breast cancer. We propose a kernel weighted l1-regularized regression for incorporating subtype information to reveal gene regulations affected by different breast cancer subtypes. Comparing with typical method, our result shows prediction improvement of gene expression level over subtypes. Also, we identified subtype-specific network structure by carrying out the association study between gene expression and DNA methylation.

Top

S3-3: Racial differences of intron retention and DNA methylation in breast cancer subtypes

Dongwook Kim1, Manu Shivakumar2, Michael Sinclair1, Youngji Lee3, Dokyoon Kim2,4,*, Younghee Lee1,*

1 Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT 84102, USA
2 Department of Biomedical & Translational Informatics, Geisinger Health System, Danville, PA, USA
3 Department of Health and Community Systems, University of Pittsburgh School of Nursing, Pittsburgh, PA 15261, USA
4 The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA


Abstract
Regulation of gene expression by DNA methylation in gene promoter regions is well-studied; however, the effects on gene expression of methylation in the gene body (i.e., exons and introns) is comparatively understudied. Recently, hyper-methylation has been implicated in the inclusion of alternatively spliced exons; moreover, exon recognition can be enhanced by recruiting the methyl-CpG-binding protein (MeCP2) to hyper-methylated sites. In this study, we examined whether or not the level of methylation of an intron is correlated with how frequently that intron is retained during splicing. We analyzed DNA methylation and RNA sequencing data from breast cancer tissue samples in The Cancer Genome Atlas (TCGA). With breast cancer, most novel cancer-specific mRNA isoforms are due to intron retention. We found that hypo-methylation of introns is correlated with higher levels of intron expression in mRNA. In other words, the methylation level of an intron is inversely correlated with its retention in mRNA transcripts from the gene in which it is located. Furthermore, we observed significant racial difference in the methylation level of retained introns: In samples from African-American donors, retained introns were not only less methylated compared to Caucasian donors, but also were more highly expressed. Our findings have translational implications for improving diagnosis, prognosis, and treatment for breast cancer. Understanding racial epigenetic differences and their correlation with breast cancer is an important step toward achieving personalized cancer care. Moreover, IR is not only limited to breast cancer; transcriptomes from many different types of cancer show higher incidence of IR compared to healthy controls.

Top

S3-4: Identification of clinically relevant genes from mRNA and splicing changes of skin cutaneous melanoma

Ji Yeon Park1, Brian Y Ryu1, Chan Hee Park1, Bin Tian2 and Ju Han Kim1,*

1 Seoul National University Biomedical Informatics (SNUBI), Division of Biomedical Informatics, Seoul National University College of Medicine, Seoul, Republic of Korea
2 Department of Microbiology, Biochemistry and Molecular Genetics, Rutgers New Jersey Medical School, Newark, New Jersey, USA


Abstract
Skin cutaneous melanoma (SKCM) is a cancer of the highest mutational load, and the DNA-level aberrations have been clarified through comprehensive genome sequencing. However, the transcriptional and posttranscriptional states by numerous genetic alterations remain to be fully characterized. In this study, using genomic data provided by The Cancer Genome Atlas (TCGA), we defined RNA-level genetic alterations at both transcript and exon levels between primary and metastatic samples of SKCM. Many genes related to immune response and epidermis development were significantly regulated at transcription. On the contrary, exon-level splicing changes were shown marginal, at least in the number of affected genes, but their functional link was predicted in cancer cell signaling. The mRNA expression of Epithelial Splicing Regulatory Protein 2 (ESRP2) was shown useful as a predictive marker of epithelial phenotype. To evaluate the clinical value of RNA-based measurements, we also tested the influence of somatic mutations and the correlation with patient survival times according to mRNA abundance. Our RNA-level findings promote the functional interpretation of genetic variants, and our exon-level analysis provides a more complete view of altered transcriptomics in SKCM.

Top

S4. Bio/Medical Data Mining

Room: Terrace Ballroom
Date: Sunday, Oct. 16, 14:00 - 16:40
S4-1: Disease Causality Extraction based on Lexical Semantics and Clause Frequency from Biomedical Literature

Dong-gi Lee1 and Hyunjung Shin1,*

1 Department of Industrial Engineering, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon 443-749, South Korea

Abstract
Motivation: Recently, research on human disease network has succeeded and has become an aid in figuring out the relationship between various diseases. In most disease networks, however, the relationship between diseases has been simply represented as an association. This representation results in the difficulty of identifying prior diseases and their influence on posterior diseases. In this paper, we propose a causal disease network that implements disease causality through text mining on biomedical literature.
Methods: To identify the causality between diseases, the proposed method includes two schemes: the first is the lexicon-based causality term strength, which provides the causal strength on a variety of causality terms based on lexicon analysis. The second is the frequency-based causality strength, which determines the direction and strength of causality based on document and clause frequencies in the literature.
Results: We applied the proposed method to 6,617,833 PubMed literature, and chose 195 diseases to construct a causal disease network. From all possible pairs of disease nodes in the network, 1,011 causal pairs of 149 diseases were extracted. The resulting network was compared with that of a previous study. In terms of both coverage and quality, the proposed method showed outperforming results; it determined 2.7 times more causalities and showed higher correlation with associated diseases than the existing method.

Top

S4-2: ICU Event Prediction by integrating Sequential Patterns as Classification Features

Shameek Ghosh1,*, Jinyan Li1, Hung Nguyen2, Kotagiri Ramamohanarao3

1 Advanced Analytics Institute, Faculty of Engineering and IT,
2 Centre for Health Technologies, Faculty of Engineering and IT, University of Technology Sydney, NSW 2007, Australia
3 Department of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia, 3010, Australia


Abstract
Pattern mining algorithms have been previously utilized to extract informative rules in various clinical contexts. However, the number of generated patterns is numerous. In most cases, the extracted rules are directly investigated by clinicians for understanding disease diagnoses. As the elicitation of important patterns for clinical investigation places a significant demand for precision and interpretability, it is essential to obtain a set of interpretable patterns for building advanced learning models about a patient's physiological condition, especially in critical care units. In this study, a two stage sequential contrast patterns based classification framework is presented, which is used to detect critical patient events like hypotension and patient mortality. In the first stage, we obtain a set of sequential patterns by using a contrast mining algorithm. These sequential patterns undergo post-processing, for conversion to binary valued or frequency based features for developing a classification model in the second stage. Our results on six critical care hypotension datasets and one large scale mortality prediction dataset demonstrate better predictive capabilities, when sequential patterns are used as features.

Top

S4-3: Quad-phased Data Mining Modeling for Dementia Diagnosis

Sunjoo Bang1, Hyunwoong Noh2, Jihye Lee3, Sungyun Bae3, Kyungwon Lee3, Changhyung Hong2, Sangjoon Son2,*, Hyunjung Shin1,*

1 Department of Industrial Engineering, Ajou University,
2 Department of Psychiatry, Ajou University School of Medicine,
3 Department of Digital Media, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon 443-749, South Korea


Abstract
The number of people with dementia is increasing along with people’s ageing trend worldwide. Therefore, there are various researches to improve a dementia diagnosis process in the field of computer-aided diagnosis (CAD) technology. The most significant issue is that the evaluation processes by physician which is based on medical information for patients and questionnaire from their guardians are time consuming, subjective and prone to error. This problem can be solved by an overall data mining modeling, which subsidizes an intuitive decision of clinicians. Therefore, in this paper we propose a quad-phased data mining modeling consisting of 4 modules. In Proposer Module, significant diagnostic criteria are selected that are effective for diagnostics. Then in Predictor Module, a model is constructed to predict and diagnose dementia based on a machine learning algorism. To help clinical physicians understand results of the predictive model better, in Descriptor Module, we interpret causes of diagnostics by profiling patient groups. Lastly, in Visualization Module, we provide visualization to effectively explore characteristics of patient groups. The proposed model is applied for CREDOS study which contains clinical data collected from 37 university-affiliated hospitals in republic of Korea from year 2005 to 2013.

Top

S4-4: Medical Concepts Embedding

Xu Min1, Xiaolei Xie2, Haibo Wang3,4, Ning Chen1, Ting Chen1,5

1 MOE Key Lab of Bioinformatics; Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST; Department of Computer Science and Technology; State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing 100084 China
2 Department of Industrial Engineering, Tsinghua University, Beijing 10084 China
3 Clinical Trial Unit, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510080 China
4 China Standard Medical Information Research Center, Shenzhen, Guangdong 518054 China
5 Program in Computational Biology and Bioinformatics, University of Southern California, Los Angels, CA 90089 USA


Abstract
One challenge in healthcare analytics is that there are a large number of medical concepts, such as clinical diagnoses and surgical operations. Proper low-dimensional representation of these medical concepts are necessary for subsequent tasks. In this paper, we propose a fast and efficient model that embeds these medical concepts into a low-dimensional Euclidean space using the Skip-gram algorithm based on the co-occurrence information to conserve the relatedness of these concepts. To prove the effectiveness of the learned embedded representation, we apply our embedding method into the patient expense prediction problem using the HQMS (Hospital Quality Monitoring System) data. In experiments, we compare our model with the one-hot vector representation method according to the prediction accuracy, showing a much improved R2 value. The embedded vectors are further visualized by the t-SNE technique to demonstrate the effectiveness of grouping related medical concepts. We also analyze the model sensitivity, and show that our model is not sensitive to the window size. Finally, we show that the embedding quality is positively correlated to the embedding dimension.

Top

S5. Network Biology and Medicine

Room: Terrace Ballroom
Date: Monday, Oct. 17, 10:00 - 11:40
S5-1: Integrative Information Theoretic Network Analysis for GWAS of Aspirin Exacerbated Respiratory Disease in Korean Population

Sehee Wang1, Hyun-hwan Jeong2,3, Dokyoon Kim4,5, Kyubum Wee1, Hae-Sim Park6, Seung-Hyun Kim6,7,*, Kyung-Ah Sohn1,*

1 Department of Software and Computer Engineering, Ajou University, Suwon 16499, South Korea
2 Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, Texas 77030, USA
3 Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
4 Department of Biomedical & Translational Informatics, Geisinger Health System, Danville, PA 17822, USA
5 The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA
6 Department of Allergy and Clinical Immunology, Ajou University School of Medicine, Suwon, Korea
7 Translational Research Laboratory for Inflammatory Disease, Clinical Trial Center, Ajou University Medical Center, Suwon, South Korea


Abstract
Aspirin Exacerbated Respiratory Disease (AERD) is a chronic medical condition that encompasses asthma, nasal polyposis, and hypersensitivity to aspirin and other non-steroidal anti-inflammatory drugs. Several previous studies have shown that part of the genetic effects of the disease may be induced by the interaction of multiple genetic variants. However, heavy computational cost as well as the complexity of the underlying biological mechanism has prevented a thorough investigation of epistatic interactions and thus most previous studies have typically considered only a small number of genetic variants at a time. In this study, we propose a gene network based analysis framework to identify genetic risk factors from a genome-wide association study dataset. We first derive multiple single nucleotide polymorphisms (SNP)-based epistasis networks that consider marginal and epistatic effects by using different information theoretic measures. Each SNP epistasis network is converted into a gene-gene interaction network, and the resulting gene networks are combined as one for downstream analysis. The integrated network is validated on existing knowledgebase of DisGeNET for known gene-disease associations and GeneMANIA for biological function prediction. We demonstrated our proposed method on a Korean GWAS dataset, which has genotype information of 440,094 SNPs for 188 cases and 247 controls. The topological properties of the generated networks are examined for scale-freeness, and we further performed various statistical analyses in the Allege and Asthma Portal (AAP) using the selected genes from our integrated network. Our result reveals that there are several gene modules in the network that are of biological significance and have evidence for controlling susceptibility and being related to the treatment of AERD.

Top

S5-2: Taking promoters out of enhancers in sequence based predictions of tissue-specific mammalian enhancers

Julia Herman-Izycka1, Michal Wlasnowolski1, Bartek Wilczynski1,*

1 University of Warsaw, Krakowskie Przedmieście 26/28, 00-927 Warszawa, Poland

Abstract
Motivation: Many genetic diseases are caused by mutations in non-coding regions of the genome. These mutations are frequently found in enhancer sequences, causing disruption to the regulatory programme of the cell. Enhancers are short regulatory sequences in the non-coding part of the genome that are essential for the proper regulation of transcription. While the experimental methods for identification of such sequences are improving every year, our understanding of the rules behind the enhancer function has not progressed much in the last decade. This is especially true in case of tissue-specific enhancers, where there are clear problems in predicting specificity of enhancer function.
Results: We show a random-forest based machine learning approach capable of matching the performance of the current state-of-the-art methods for enhancer prediction. Then we show that it is, similarly to other published methods, frequently cross-predicting enhancers as active in different tissues, making it less useful for predicting tissue specific activity. Then we proceed to show that the problem is related to the fact that the enhancer predicting models exhibit a bias towards predicting gene promoters as active enhancers. Then we show that using a two-step classifier can lead to lower cross-prediction between tissues.
Availability: The software needed to train the models is available at http://github.com/regulomics/enhancer prediction and the predictions themselves are available at http://regulomics.mimuw.edu.pl:8888

Top S5-3: Modeling Long-Term Human Activeness Using Recurrent Neural Networks for Biometric Data

Zae Myung Kim1, Chae-Gyun Lim1, Hyungrai Oh2, Kyo-Joong Oh1, Ho-Jin Choi1,*

1 School of Computing, KAIST, Daejeon 34141, South Korea
2 Samsung Seoul R&D Campus, Samsung Electronics, Seoul 06765, South Korea


Abstract
This paper explores the feasibility of modeling a person's "activeness" using biometric data retrieved from a fitness tracker. Currently, the notion of activeness of a user at a given period time is defined to be a tuple of three types of biometric data: heart rate, consumed calories, and the number of steps taken. Four recurrent neural network architectures are proposed to investigate the performance on predicting the activeness of the user under various length-related hyper-parameter settings. In addition, the learned model is tested to predict the time period when the user's activeness falls below a certain threshold. The dataset used in this study consists of several months of biometric time series data gathered by seven users independently. The experimental results show that forecasting the users' activeness is indeed feasible under suitable lengths of input and output sequences.

Top

S5-4: Cascade Recurrent Deep Networks for Audible Range Prediction

Yonghyun Nam1, Oak-Sung Choo2, Yu-Ri Lee2, Yun-Hoon Choung2,*, Hyunjung Shin1,*

1 Department of Industrial Engineering, Ajou University, Suwon, Korea
2 Department of Otolaryngology, Ajou University School of Medicine, Suwon, Korea


Abstract
Hearing Aids amplify sounds at certain frequencies to help patients, who have hearing loss, to improve the quality of life. Variables affecting hearing improvement include the characteristics of the patients' hearing loss, the characteristics of the hearing aids, and the characteristics of the frequencies. Although the two former characteristics have been studied, only few models reflect the characteristics of frequencies. Therefore, we propose a new machine learning algorithm that can present the degree of hearing improvement expected from the wearing of hearing aids. The proposed algorithm consists of cascade structure, recurrent structure and deep network structure. For cascade structure, it reflects correlations between frequency bands. For recurrent structure, output variables in one particular network of frequency bands are reused as input variables for other networks. Furthermore it is of deep network structure with many hidden layers. We denote such networks as cascade recurrent deep network where training consists of two phases; cascade phase and tuning phase. When applied to medical records of 2,182 patients treated for hearing loss, the proposed algorithm reduced the error rate by 58% from the other neural networks. The proposed algorithm is a novel algorithm that can be utilized for signal or sequential data. Clinically, the proposed algorithm can serve as a medical assistance tool that fulfill the patients' satisfaction.

Top

S6. PharmacoGenomic Applications

Room: Regency Ballroom
Date: Monday, Oct. 17, 14:00 - 15:15
S6-2: Tissue specificity of in vitro drug sensitivity

Fupan Yao1,2,§, Zhaleh Safikhani1,2,§Seyed Ali Madani Tonekaboni1,2, Petr Smirnov3,4, Nehme El­Hachem1, Mark Freeman1,2, Venkata Satya Kumar Manem1,2, Benjamin Haibe­Kains1,2,5,6,*

1 Princess Margaret Cancer Centre, Toronto, Ontario M5G 1L7, Canada
2 Department of Medical Biophysics, University of Toronto, Toronto, Ontario M5G 1L7, Canada 3 Integrative systems biology, Institut de Recherches Cliniques de Montréal, Montreal, Quebec, Canada
4 Department of Medicine, University of Montreal, Montréal, Quebec, Canada
5 Department of Computer Science, University of Toronto, Toronto, Ontario M5T 3A1, Canada 6 Ontario Institute of Cancer Research, Toronto, Ontario M5G 1L7, Canada


Abstract
Research in oncology traditionally focuses on specific tissue type from which the cancer develops. However, advances in high­throughput molecular profiling technologies have enabled the comprehensive characterization of molecular aberrations in multiple cancer types. It was hoped that these large­scale pharmacogenomic data would provide the foundation for a paradigm shift in oncology which would see tumors being classified by their molecular profiles rather than tissue types, but tumors with similar genomic aberrations may respond differently to targeted therapies depending on their tissue of origin. There is therefore a need to reassess the potential association between pharmacological response and tissue of origin for cytotoxic and targeted therapies, as well as how these associations translate from preclinical to clinical settings. In this paper, we investigate the tissue specificity of drug sensitivities in large­scale pharmacological studies and compare these associations to those found in clinical trial descriptions. Our meta­analysis of the four largest i n vitro drug screening datasets indicates that tissue of origin is strongly associated with drug response. We identify novel tissue­drug associations, which may present exciting new avenues for drug repurposing. One caveat is that the vast majority of the significant associations found in preclinical settings do not concur with clinical observations. Accordingly, our results call for more testing to find the root cause of the discrepancies between preclinical and clinical observations.

Top

S6-2: Genome Sequence Variability Predicts Drug Precautions and Withdrawals from the Market.

Kye Hwa Lee1, Su Youn Baik1, Soo Youn Lee1, Chan Hee Park1, Paul J. Park3, Ju Han Kim1,2,*

1 Seoul National University Biomedical Informatics (SNUBI), Division of Biomedical Informatics, Seoul National University College of Medicine, Seoul 110799, Korea
2 Biomedical Informatics Training and Education Center (BITEC), Seoul National University Hospital, Seoul 110744, Korea
3 Department of Physiology and Cell Biology, University of Nevada School of Medicine, Reno, NV, USA


Abstract
Despite substantial premarket efforts, a significant portion of approved drugs has been withdrawn from the market for safety reasons. The deleterious impact of nonsynonymous substitutions predicted by the SIFT algorithm on structure and function of drug-related proteins was evaluated for 2504 personal genomes. Both withdrawn (n=154) and precautionary (Beers criteria (n=90), and US FDA pharmacogenomic biomarkers (n=96)) drugs showed significantly lower genomic deleteriousness scores (P < 0.001) compared to others (n=752). Furthermore, the rates of drug withdrawals and precautions correlated significantly with the deleteriousness scores of the drugs (P < 0.01); this trend was confirmed for all drugs included in the withdrawal and precaution lists by the United Nations, European Medicines Agency, DrugBank, Beers criteria, and US FDA. Our findings suggest that the person-to-person genome sequence variability is a strong independent predictor of drug withdrawals and precautions. We propose novel measures of drug safety based on personal genome sequence analysis.

Top S6-3: Network Mirroring for Drug Repositioning

Sunghong Park1, Dong-gi Lee1, Hyunjung Shin1,*

1 Department of Industrial Engineering, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon 443-749, South Korea

Abstract
Although drug discoveries can provide meaningful insights and significant enhancements in pharmaceutical field, the longevity and cost that it takes can be extensive where the success rate is low. In order to circumvent the problem, there has been increased interest in 'Drug Repositioning' where one searches for already approved drugs that have high potential of efficacy when applied to other diseases. To increase the success rate for drug repositioning, one considers stepwise screening and experiments based on biological reactions. Given the amount of drugs and diseases, however, the one-by-one procedure may be time consuming and expensive. In this study, we propose a machine learning based approach to efficiently selecting candidate disease and drugs. We assume that if two diseases are similar, then a drug for one disease can be applicable to other disease. For the procedure, we first construct two disease networks; one with disease-protein association and the other with disease-drug information. If two networks are dissimilar, it remains room for being either candidate disease for a drug or candidate drugs for a disease. The Kullback-Leibler divergence is employed to measure difference of connections in two constructed disease networks. Lastly, we perform repositioning of drugs to the top 20% ranked diseases. The results showed that F-measure of the proposed method was 0.75, outperforming 0.5 of greedy searching for the entire diseases.

Top



S7. Linking Phenotypes

Room: Terrace Ballroom
Date: Monday, Oct. 17, 14:00 - 15:15
S7-1: An integrative approach for analyzing host factors during tuberculosis infection

Rama Kaalia1, Indira Ghosh11,*

1 School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India

Abstract
Tuberculosis (TB) is an infectious disease caused due to Mycobacterium tuberculosis (MTB). Though pathogenic virulence is important, host response to MTB is known to play an important role in the manifestation of clinical symptoms of the disease. Not everyone exposed to the bacterium get sick with this disease. Identifying target genes in MTB is important, but need for completely eliminating TB requires focus on host-pathogen interactions. The main objective of the present work is to use a context-based approach to integrate different levels of information available for the disease and to study the factors associated with host response in TB infection. We have developed a Disease Association Ontology for Tuberculosis (DAO-tb) that provides a standard ontology-driven platform for describing host genes/proteins, pathways involved in tuberculosis, role of host genes during infection and for integrating functional associations from various interaction levels (gene-disease, gene-pathway, gene-function, gene-cellular component and protein-protein interactions). DAO-tb consists of 79 classes including 7 super classes. Our ontology provides a semantic based framework for querying and analyzing the disease associated information in the form of RDF graphs. Link analysis algorithms (PageRank, HITS (Hyperlink Induced Topic Search) and HITS with semantic weights) are used to score the host gene nodes on the basis of their functional associations during infection. The above developed protocol is used to predict novel potential host based targets for TB from the long list of loose gene- disease associations.

Top

S7-2: A meta-analysis of gene expression profiles to discover obesity signatures in peripheral blood mononuclear cells

Haangik Park1, Yul Kim1, and Gwan-Su Yi1,*

1 Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Korea

Abstract
Obesity is typically defined as a state that abnormal amount of body fat accumulation. Due to its association with various disease pathogenesis, revealing biological mechanisms and constructing a model of obesity becomes popular. Among the studies about the obesity, gene signatures within blood tissue were often proceeded because of its interrelation with fat tissues. However, previous studies between obese patients and controls had limitation about lack of sample number and high model-dependent variability. These problems made severe difficulty for the construction of general obesity model in blood. To overcome this drawback, we constructed meta-dataset by merging four blood transcriptome microarray datasets between obesity patients and control subjects. Next we introduced a statistical testing and several classification task based on a combination of random partitioning, t-test and SVM-RFE. We ensured a validity of our own selection method by cross-validation. As a consequence of our approach, 50 differential gene expression signatures appeared among 124 obesity patients has been obtained. Furthermore we demonstrated our finding was associated with key obesity mechanisms and some diseases caused by obesity. In conclusion, we revealed obesity signatures in blood tissues which can be applied to an effect of an obesity on the entire body and obesity-related disease pathogenesis studies.

Top

S7-3: SEXCMD : Development of Sex Determination Markers for next-generation sequencing data

Seongmun Jeong1, Jiwoong Kim2, Won Park1, Namshin Kim1,*

1 Personalized Genomic Medicine Research Center, Division of Strategic Research Groups, Korea Research Institute of Bioscience and Biotechnology, Daejeon 34141, Korea
2 Quantitative Biomedical Research Center, Department of Clinical Sciences, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA


Abstract
Generally, array-based single nucleotide polymorphism (SNP) genotyping technology uses a few markers or B-allele frequency of sex chromosome for sex determination. However, it is not applicable for the latest next- generation sequencing (NGS)-based data types because those markers should be known a priori. Also, one should align all reads onto reference genome to get B-allele frequency information on sex chromosomes. We developed novel approach to extract sex marker sequences from sex chromosomes in vertebrate and mammalian genomes. By simply counting total number of reads mapped onto each sex marker sequences, we can easily identify sex information in a very short time without aligning all sequence reads. We successfully tested out bioinformatics pipeline and sex marker sequences on human. Usually, we can identify human sex information from exome-sequencing data within a few minutes and an hour or so for high-coverage whole genome sequencing data. Y chromosome in human has psuedoautosomal regions (PAR) which are exactly duplicates from X chromosome. If we include Y chromosome for female genotyping, those sequences reads can be moved to Y chromosome instead. Here we report an open-source and easy-to-use program "SEXCMD" that can identify sex using user-created sex marker. It aligns reads onto the created sex marker sequences extracted from homologous regions between sex chromosomes, and counts the numbers of mapped reads. SEXCMD gives putative sex information within about 10 minutes for human whole genome sequencing data.

Top