TBC 2012 paper

S1-1

MiST: Variant-detection through whole-exome sequencing

Sailakshmi Subramanian¹, Valentina Di Pierro¹, Hardik Shah¹, Ajish George¹, Bruce Gelb¹, Ravi Sachidanandam¹

¹United States Mount Sinai School of Medicine

Whole-exome sequencing is a promising approach to find causative mutations in human disease, especially for Mendelian disorders. It involves the capture of sequences from exons in genomic DNA using probes from exonic regions of the genome. The captured exonic sequences are deeply sequenced and analyzed for variants from the reference genome. There are several tools to align sequenced reads to reference genomes and call SNPs and variants. We have developed a variant-calling platform, MiST that builds on our previously published tool, Geoseq. The tool mimics the experimental technique, computationally fishing reads from the deep sequencing set using probes from the targeted exons. The captured reads are mapped with great sensitivity to accurately call SNPs and variants. Our pipeline carefully eliminates paralogous read- mapping, which can lead to spurious SNP calls. It also tracks strand-bias and clonality in the sequencing libraries, allowing for more accurate measurements of coverage and variant detection. The platform identifies variant calls that have already been seen in other samples by comparing them to a database of known variants collected from dbSNP, 1000-genomes and private variant collections. A web-based interface allows users to visualize the alignments and other raw data underlying a variant call. The user can rapidly filter calls based on known and predicted functional characteristics. The pipeline is parallelizable and runs over a cluster, allowing the process to be scaled up. It also comes with a web-based interface that allows end-users to explore and visualize the data. We used targeted re-sequencing (Sanger) to confirm the validity of a few of the variants inferred by MiST. In addition, we compare it to variants calls made by the gatk platform and demonstrate the benefits of our approach, as well as the commonalities between the programs.

S1-2

Improve the Nucleotide Coding Technique, Use Support Vector Machine, Get the Better Accuracy: Survey of Human Splice Site Prediction

A.T.M.Golam Bari¹, Mst.Rokeya Reaz¹, Md.Azam Hossain¹, Ho-Jin Choi², Byeong-Soo Jeong¹

¹Kyung Hee University, Dept. of Computer Engineering, 1732 Deokyoungdaero, Giheung-gu, Yongin-si, Gyeonggi-do, 446-701, Republic of Korea
²Korea Advanced Institute of Science and Technology,335 Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea

Splice site prediction in DNA sequence is a basic search problem for finding exon-intron and intron-exon boundary. Removing introns and then joining the exons together forms the coding sequences which are the input of translation process and a necessary step in central dogma of molecular biology. Finding out the exact GT and AG ending sequence among the set of ATCGs sequence and identifying the true and false GT and AG ending sequences are the main task of splice site prediction. In this paper, we survey recent research works on splice site prediction based on support vector machines (SVM)). The basic difference among these works is nucleotide encoding technique - some methods encode sparse way whereas others encode in a probabilistic manner. All these coding sequences serve as input of SVM. The task of SVM is to classify them using its learning model. We observe each coding techniques and classify them according to their similarity. Our survey paper will provide basic understanding of encoding approach for splice site prediction.

S1-3

New Features of MTRAP Alignment and its Advantage: All-in-one Interface for Sequence Analysis, MSA and the Support for Non-coding RNA

Toshihide Hara^1,2, Keiko Sato^1,2, Masanori Ohya^1,2

¹Department of Information Sciences, Tokyo University of Science, ²Quantum Bio-Informatics Research Division, Tokyo University of Science, 2641 Yamazaki, Noda City, Chiba, Japan

Sequence alignment of proteins or DNA/RNA sequences is one of the most important things in modern bioinformatic analysis. In this field studies start with the comparison of target sequences, and the comparison is realized by constructing the alignment. Under a rapid increase of genome data from the growth of Next Generation Sequencing, the need for high quality alignment becomes more apparent. Although there exists an obvious need, the quality level is not enough. Recently we developed a high quality alignment method called MTRAP. We showed that the significant improvement of sequence alignment can be done by considering the correlation between two consecutive pairs of residues. In the first paper we showed that our method generates good results for protein sequences, but it is not understood whether it works for DNA/RNA sequences or not. In this paper, we show our recent study for non-coding RNA sequences. In addition, we explain the new features of recent version of MTRAP.

S2-1

Computational Methods for Cancer Subtype Classification using Integrated Data

Shinuk Kim^1,2,3, Taesung Park², Mark Kon^1,3

¹Bioinformatics program, Boston University, Boston, MA 02215, USA
²Department of Statistics, Seoul National University, Seoul 151-747 Republic of Korea
³Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA

MicroRNAs (miRNAs) are known to be strongly involved in cancer pathology through regulation of target messenger RNA (mRNA) molecules. We study a potentially useful methodology based on machine learning (ML) involving integration of separate biomarker classes to improve prediction and separation of ovarian cancer survival times. We use an ML-based protocol for feature selection, integrating information from miRNA and mRNA profiles at the feature level. For prediction of survival phenotypes, we use two classifiers, one a machine learning method (support vector machine, SVM), and the second a novel regression-based method (SVM-based Fisher feature selection together with Cox proportional hazard regression, FSCR). We compared these two methods using three types of cancer tissue features: i) miRNA expression, ii) mRNA expression, and iii) integrated miRNA and mRNA expression information, with features selected either from combined miRNA/mRNA profiles (CFS), or separately from the two feature sets (IFS). The accuracy of survival classification using the combined miRNA/mRNA profiles was 88.64 % using IFS-SVM, and 84.09% using IFS-FSCR in a balanced dataset. These accuracies are higher than those using miRNA alone (81.82%, SVM; 75%, FSCR) or mRNA alone (70.45%, SVM; 72.73%, FSCR). The latter differences indicate sometimes strong interactions between miRNA and mRNA features which are not visible in individual analyses. In addition we focus on the most significant miRNAs obtained by SVM-based feature selection which include hsa-miR-23b, hsa-miR-27b. We predicted 16 target genes of hsa-miR-23b and hsa-miR-27b, by integrating sequence information , and information of gene expression profile which include cancer related genes.

S2-2

A combination algorithm for 5-year survivability of breast cancer patient

Kung-Jeng Wang¹, Bunjira Makond¹, and Kun-Huang Chen^1,2

¹Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan, R.O.C. ²School of Dentistry, College of Oral Medicine, Taipei Medical University, Taipei, 110, Taiwan, R.O.C.

In this study, we have proposed the new algorithm to enhance the effectiveness of classification for 5-year survivability of breast cancer patients which the data set is imbalanced. The algorithm is the combination of Synthetic minority oversampling technique (SMOTE) and Particle swarm optimization (PSO) based decision tree (C5): SMOTE+PSO+C5. G-mean is a metric to evaluate the proposed algorithm for classification; moreover, the proposed algorithm is compared with PSO+C5 and C5. The results show that SMOTE+PSO+C5 algorithm has the highest performance for 5-year survivability of breast cancer patient classification when the data set is imbalanced. This proposed method can classify well for both survival and non-survival cases. In addition, implementation PSO+C5 method to imbalanced data cannot improve the classification performance from using standard classifier solely.

S2-3

Gene Interaction-Level Cancer Classification using Gene Expression Profiles

Ashis Saha, Jaewoo Kang¹

¹Korea University, Seoul 136713, Korea.

Recent studies suggest that biological pathways have the power to be stronger biomarkers for cancer than individual genes. The knowledgebase of pathways contains the interactions among the genes. However, it is not necessary for all the genes in a pathway to interact with each other. Closely interacting genes are supposed to have a collective effect to cause cancer or other disease. Here we propose a novel cancer classification method utilizing the collective effect of the set of closely interacting genes which we call Gene Interaction Set (GIS). We first find out the possible strength levels of each gene interaction set using clustering method and then rank all the sets with our proposed entropy metric using the proportion of samples of different classes having same strength level and finally predict the class of a new sample by weighted voting of top k gene interaction sets. The important feature of our method is that the process of causing the disease can easily be figured out. We validate our method comparing with other classification methods known to produce very high accuracy on 7 cancer datasets.

S3-1

Globally Inferring Targets From Phenotypic Small-Molecule Screens

S. Joshua Swamidass^1,2, Michael Barratt¹, Bradley T. Calhoun¹

¹Division of Laboratory and Genomic Medicine, Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO.
²Chemical Biology/Novel Therapeutics, Broad Institute of Harvard and MIT, Cambridge, MA

A central challenge in modern drug discovery is the identification of the target proteins and pathways that can be manipulated to modulate disease. Gaps in our understanding of how targets modulate disease are evident in the high rate of Phase II clinical trial failures, when medicines are first tested for efficacy. The high reward for finding novel connections between targets and diseases is evident in several examples where known medicines have been repurposed to treat new diseases. In this study, we present and validate a new way of Globally Inferring protein Targets from Phenotypes (GIPT) by finding patterns in small-molecule screens of medically-relevant, cellular assays. Mining phenotypic, small-molecule screens is a promising strategy because it leverages translatable experimental data and because it is biased towards druggable proteins. We demonstrate that this strategy can both recover known targets and suggest plausible novel targets for several medically-relevant phenotypes---including insulin signaling, amyloid precursor protein expression, and cyclic-AMP levels---with applications in diabetes, Alzhiemer's diease, and depression.

S3-2

More Reproducible Results from Small-sample Clinical Genomics Studies by Multi-Parameter Shrinkage, with Application to High-throughput RNA Interference Screening Data

Mark A. van de Wiel¹, Renee X. de Menezes², Ellen Siebring^2,3, Victor W. van Beusechem²

¹Department of Epidemiology and Biostatistics, ²RNA Interference Functional Oncogenomics Laboratory (RIFOL), ³Department of Pulmonary Disease, VU University Medical Center, PO Box 7057, 1007 MB Amsterdam, The Netherlands

High-throughput (HT) RNA interference screens are increasingly used for reverse genetics and drug discovery. These experiments are laborious and costly, hence sample sizes are often very small. Powerful statistical techniques to detect siRNAs that potentially enhance treatment are currently lacking, because they do not optimally use the amount of data in the other dimension, the feature dimension. We introduce ShrinkHT, a Bayesian method for shrinking multiple parameters in a statistical model, where `shrinkage' refers to borrowing information across features. ShrinkHT is very flexible in fitting the effect size distribution for the main parameter of interest, thereby accommodating skewness that naturally occurs when siRNAs are compared with controls. In addition, it naturally down-weights the impact of nuisance parameters (e.g. assay-specific effects) when these tend to have little effects across siRNAs. We show that these properties lead to better ROC-curves than with the popular limma software. Moreover, in a 3 + 3 treatment vs control experiment with `assay' as an additional nuisance factor, ShrinkHT is able to detect three significant siRNAs with stronger enhancement effects than the positive control. In the context of gene-targeted (conjugate) treatment, these are interesting candidates for further research.

S3-3

Breast Cancer Survivability Prediction with Labeled, Unlabeled, and Pseudo-Labeled Patient Data

Juhyeon Kim¹, Hyunjung Shin¹

¹Department of Industrial Engineering, Ajou University, Wonchun-dong, Yeongtong-gu, Suwon 443-749, South Korea

Prognostic study on breast cancer survivability has been aided by machine learning algorithms which provide prediction on the survival of a particular patient on the basis of historical patient data. A labeled patient record however, is not easy to collect. It takes at least five years to label a patient record as “survived" or "not survived”: meanwhile, unguided trials on numerous types of oncology-therapy cost highly. Moreover, it requires confidentiality agreements from both doctors and patients to obtain a labeled patient record. The difficulties in collection of labeled patient data have drawn researchers' attention to Semi-Supervised Learning (SSL), one of the most recent machine learning algorithms, since it is capable of utilizing unlabeled patient data as well which relatively much easier to collect, and therefore is regarded as a pertinent algorithm to circumvent the difficulties. However, the fact is yet valid even on SSL that more labeled data lead to better prediction. To make up for insufficiency of labeled patient data, one may consider an idea of tagging virtual labels to unlabeled patient data, namely “pseudo-labels”, and using them as if they are labeled. The proposed algorithm, "SSL Co-training", implements the idea based on SSL. SSL Co-training was tested on the surveillance, epidemiology, and end results database for breast cancer (SEER) and achieved avg. 76% accuracy and avg. 0.81 AUC.

S4-1

Semantic PubMed Searches

Illhoi Yoo^{1, 2}

¹Health Management & Informatics, School of Medicine, ²Informatics Institute, University of Missouri, Columbia, MO, USA

The Evidence-Based Medicine (EBM) Working Group has defined efficient biomedical literature searching as a core skill required for the practice of the EBM. Although the information obtained from PubMed could significantly improve the quality of health care, physicians typically do not pursue their questions about patient care. This paper discusses the importance of PubMed searches for physicians, identifies the origin of the well-known obstacles to answering physicians’ clinical questions using PubMed, and introduces a novel system called Semantic-oriented MEDLINE search (SoMs) to the original problems to enhance their information retrieval experience in PubMed. Based on the variety of the literature in information retrieval, cognitive science, and medical science, we analyzed widely accepted obstacles to answering physicians’ clinical questions and then identified the origins of the obstacles to provide a technical solution for each obstacle category. Physicians’ information seeking behavior problem is two-fold: a user-side problem and a system-side problem. The user-side problem comes from the user’s emergent information needs and unfamiliarity with MeSH terms and the MeSH Tree, and the system-side problem comes from the fragmented information available from PubMed. We suggest the use of a biomedical semantic network with a concept-filtering tool to address the emergent information need problem, and the Concept-Based PubMed Archive (CBPA) to address the fragmented information problem. The SoMs can concisely answer many clinical questions PubMed cannot.

S4-2

Research Domain Grouping and Analysis in Bioinformatics Domain using Text Mining

Junbeom Kim¹, Chae-Gyun Lim¹, Sung Suk Kim¹, Dukyong Yoon², Rae-Woong Park², Ho-Jin Choi¹

¹Department of Computer Science, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Korea
²School of Medicine & Graduate School of Medicine, Ajou University, San 5 Woncheon-dong, Yeongtong-gu, Suwon 443-721, Korea

In this paper, we propose a new information extraction and analysis using a text mining of the research domains for assistance of bioinformatics research. To do this work, we use Term Frequency Inverse Document Frequency method and reference link aggregation which combine each other and induce useful information to analysis the structures and relations of the interest fields. From the information induced from TFIDF and reference link aggregation, useful connections and relations, that generates and finds new information and knowledge, can be obtained. The results help researchers to extract and find more additional knowledge of related domains and fields. To show usefulness of the proposed method, we demonstrate research domain clustering and induced results from the clusters.

S4-3

ICD-9 Tobacco Use Codes are Effective Identifiers of Smoking Status

Laura K. Wiley^1,2, Anushi Shah², Hua Xu², William S. Bush^1,2

¹Center for Human Genetics Research, ²Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA

With the increased development of clinic-based biorepositories, Electronic Medical Records (EMRs) are being used for genetic epidemiology research. These studies often require identification of and adjustment for clinical covariates, such as smoking status. Unfortunately, a patient’s smoking status is often difficult to extract from clinical text. The International Classification of Disease 9th Edition (ICD-9) contains two codes designating tobacco use - one for former and one for current use - but the reliability of these codes for classifying smoking status is often questioned due to their ambiguous use in clinical environments. In this study we evaluated the utility of these codes to identify ever-smokers in general and high smoking prevalence (lung cancer) clinic populations. We assessed potential biases in documentation, and performed temporal analysis relating transitions between smoking codes to smoking cessation attempts. We also examined the suitability of these codes for use in genetic association analyses. We establish that ICD-9 tobacco use codes can precisely identify smokers in a general clinic population (specificity = 1; sensitivity = 0.32), and that there is little evidence of documentation bias. Frequency of code transitions between “current” and “former” tobacco use is significantly correlated with initial success at smoking cessation (p<0.0001). Finally, we illustrate that code-based smoking status assignment is a comparable covariate to text-based smoking status for genetic association studies. Our results support the use of ICD-9 tobacco use codes for identifying smokers in a clinical population, and justify use of this derived status in genetic studies utilizing electronic health records.

S5-1

Extracting of Coordinated Patterns of DNA Methylation and Gene Expression in Ovarian Cancer

Je-Gun Joung^1,2,3, Dokyoon Kim^1,2, Kyung Hwa Kim^1,2, Ju Han Kim^1,2

¹Seoul National University Biomedical Informatics (SNUBI), Div. of Biomedical Informatics, ²Systems Biomedical Informatics National Core Research Center, ³Institute of Endemic Diseases, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul 110-799, Korea

DNA methylation, a regulator of gene expression, plays an important role in diverse biological processes including developmental process, carcinogenesis and aging. In particular, aberrant DNA methylation has been enormously observed in several types of cancers. Currently, it is important to extract disease-specific genesets associated with the regulation of DNA methylation. Here we propose a novel approach to find the minimum regulatory units of genes, co-Methylated and co-Expressed Gene Pairs (MEGPs) that are highly correlated gene pairs between DNA methylation and gene expression showing the co-regulatory relationship. To evaluate whether our method is meaningful to extract disease-associated genes, we applied our method to a large-scale dataset from The Cancer Genome Atlas, extracted significantly associated MEGPs and analyzed their functional correlation. We observed that our many MEGPs are physically interacted each other and show high semantic imilarity with Gene Ontology terms. Furthermore, we performed gene set enrichment tests to identify how they are correlated in a complex biological process. Our MEGPs were highly enriched in the biological pathway associated with ovarian cancers. Our approach can be useful for discovering coordinated epigenetic markers associated with specific diseases.

S5-2

Network Models of GWAS Uncover the Topological Centrality of Protein Interactions in Complex Disease Traits

Younghee Lee^1,2, Haiquan Li^1,2,3, Jianrong Li^1,2,3, Ellen Rebman^1,3, Kelly Regan³, Eric R Gamazon², James L Chen^1,4, Xinan Yang^1,2, Nancy J Cox^1,2,5, Yves A Lussier^1,2,4,5,6

¹Center for Biomedical Informatics and ²Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637
³Department of Medicine, The University of Illinois at Chicago, Chicago, IL, 60612,
⁴Section of Hematology/Oncology, Department of medicine, The University of Chicago, Chicago, IL60637
⁵Institute for Genomics and Systems Biology, and ⁶Computation Institute, The University of Chicago, Chicago, IL 60637

While Genome Wide Association Studies (GWAS) of complex traits have revealed thousands of reproducible genetic associations to date, these loci collectively confer very little of the heritability of their respective diseases and, in general, have contributed little to our understanding the underlying disease biology. Physical protein interactions have been utilized to increase our understanding of human Mendelian disease loci but have yet to be fully exploited for complex traits. Here, we hypothesized that protein interaction modeling of GWAS findings could highlight important disease-associated loci and unveil the role of their network topology in the genetic architecture of diseases with complex inheritance. Network modeling of proteins associated with the intragenic SNPs of the NHGRI catalog of complex trait GWAS revealed that complex trait associated loci are more likely to be hub and bottleneck genes in available, albeit incomplete, networks (odds ratio=1.59, FET-P value < 2.24X10-12). Network modeling also prioritized novel Type 2 Diabetes(T2D) genetic variations from the Finland-United States Investigation of NIDDM Genetics and the Wellcome Trust GWAS data, and demonstrated the enrichment of hubs and bottlenecks in prioritized T2D GWAS genes. The potential biological relevance of the T2D hub and bottleneck genes was revealed by their increased number of first degree protein interactions with known T2D genes according to several independent sources (P-value<0.01, probability of being first interactors of known T2D genes). Virtually all common diseases are complex human traits, and thus the topological centrality in protein networks of complex trait genes has implications in genetics, personal genomics, and in therapy.

S5-3

Identification of Multiple Gene-Gene Interactions for Ordinal Phenotypes

Kyunga Kim¹, Min-Seok Kwon², Sohee Oh³, Taesung Park^2,3

¹Department of Statistics, Sookmyung Women’s University, South Korea
²Interdisciplinary Program in Bioinformatics, Seoul National University, South Korea
³Department of Statistics, Seoul National University, South Korea

Multifactor dimensionality reduction (MDR) is a powerful method for analysis of gene-gene interactions and has been successfully applied to many genetic studies of complex diseases. However, the main application of MDR has been limited to binary traits, while traits having ordinal features are commonly observed in many genetic studies (e.g., obesity classification - normal, pre-obese, mild obese and severe obese). We propose ordinal MDR (OMDR) to facilitate gene-gene interaction analysis for ordinal traits. As an alternative to balanced accuracy, the use of tau-b, a common ordinal association measure, was suggested to evaluate interactions. Also, we generalized cross-validation consistency (GCVC) to identify multiple best interactions. GCVC can be practically useful for analyzing complex traits, especially in large-scale genetic studies. In simulations, OMDR showed fairly good performance in terms of power, predictability and selection stability and outperformed MDR. For demonstration, we used a real data of body mass index (BMI) and scanned 1~4-way interactions of obesity ordinal and binary traits of BMI via OMDR and MDR, respectively. In real data analysis, more interactions were identified for ordinal trait than binary traits. On average, the commonly identified interactions showed higher predictability for ordinal trait than binary traits. The proposed OMDR and GCVC were implemented in a C/C++ program, executables of which are freely available for Linux, Windows and MacOS upon request for non-commercial research institutions.

S5-4

Key genes for modulating information flow play a temporal role as breast tumor coexpression networks are dynamically rewired by letrozole

Nadia M. Penrod^1,2 and Jason H. Moore^2,3

¹Department of Pharmacology and Toxicology, ²Department of Genetics, ³Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, Hanover, NH, USA

Genes do not act in isolation but instead as part of complex regulatory networks. To understand how breast tumors react to the presence of the drug letrozole it is necessary to understand how the entire gene network changes as it is perturbed by the drug. Using transcriptomic data generated from sequential tumor biopsy samples, taken at diagnosis and following 10-14 days and 90 days on letrozole, we build temporal gene coexpression networks. Coexpression is determined by a pairwise partial correlation statistic. We find that the breast tumor network is in a continual state of flux maintaining few relationships between time points. This means that the genes integral for maintaining network integrity and controlling information flow are dynamically changing as the network is rewired. By understanding how gene-gene relationships change in the presence of the drug letrozole we can begin to understand causes of drug resistance.

S6-1

Diplotyper: Diplotype-based Association Analysis

Sunshin Kim¹, KyungChae Park², Chol Shin³, Nam H Cho⁴, Jeong-Jae Ko¹, InSong Koh⁵, KyuBum Kwack¹

¹Department of Biomedical Science, College of Life Science, CHA University, Seongnam, Korea
²Department of Family Medicine, CHA Bundang Medical Center, CHA University, Seongnam, Korea
³Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, Korea University Ansan Hospital, Ansan, Korea
⁴Department of Preventive Medicine, Ajou University School of Medicine, Suwon, Korea, 5Department of Physiology, College of Medicine, Hanyang University, Seoul, Korea

Diplotyper is a fully automated tool for performing association analysis based on diplotypes in a population. Diplotyper combines a novel algorithm designed to cluster haplotypes of interest from a given set of haplotypes with two existing tools: Haploview, for analyses of linkage disequilibrium blocks and haplotypes (with frequency threshold of 1%), and PLINK, to generate all possible diplotypes from a given population sample and calculate linear or logistic regression. In addition, procedures for generating all possible diplotype groups from the haplotype groups and transforming these diplotypes into PLINK formats were implemented. Diplotyper was tested through association analysis of hepatic lipase (LIPC) gene polymorphisms or diplotypes and levels of high-density lipoprotein (HDL) cholesterol. This analysis identified much more significant signals over single-locus tests.

S6-2

Computational Studies of Post-translational Modifications

Zexian Liu¹, Jian Ren², Yu Xue³

¹China University of Science and Technology of China
²China Sun Yat-sen University
³China Huazhong University of Science and Technology

Background: Through temporally and spatially modified proteins, post-translational modifications (PTMs) greatly expand the proteome diversity and play critical roles in regulating the biological processes. Identification of site-specific substrates is fundamental for understanding the molecular mechanisms and biological functions of PTMs, while it is still a great challenge under current technique limitations. To date, the accumulation of experimental discoveries makes it available to develop computational tools for prediction of PTMs.
Methods: To predict PTM sites, a previously developed GPS (Group-based Prediction System) algorithm was adopted and improved. Weight training and k-mean clustering methods were introduced for prediction of pupylation sites in prokaryotic proteins and tyrosine nitration sites, respectively. Besides PTMs, GPS algorithm was extended to predict I-Ag7 and HLA-DQ8 epitopes through combination with Gibbs sampling approach. The CPLA database was constructed with manually collected experimental identified lysine acetylation sites from literature. The protein-protein interaction (PPI) information for construction of protein network was collected from five major PPI databases.
Results: The GPS algorithm was improved and employed to implement a series of softwares to predict PTMs including GPS-CCD, GPS-PUP and GPS-YNO2 for prediction of calpain cleavage, pupylation, tyrosine nitration site, respectively. Furthermore, the GPS algorithm was extended to develop predictor of GPS-MBA and GPS-ARM for prediction of MHC Class II Epitopes and APC/C recognition motif, respectively. With the predictive tools and the pipeline, we systematically compared the functional distribution and preference of S-nitrosylation and nitration. The functional diversity of the D-box and KEN-box mediated APC/C recognition and degradation was also statistically exploited. In addition, by integrating existed protein acetylome data, the human lysine acetylation network (HLAN) was firstly modeled and demonstrated, while the triplet relationship among HAT-substrate-HDAC was proposed as the fundamental component of HLAN.

Conclusions: Taken together, since the developed computational tools could provide helpful information with convenience, we anticipated that the combination of computational predictions and experimental verifications will become the foundation of systematically understanding the mechanisms and the dynamics of PTMs.

S6-3

The Efficiency of Spatial model in Assigning Protein Sequences to Protein Families

Hamid Pezeshk^1,3, Vahid Rezaei^2,3

¹School of Mathematics, Statistics and Computer Science, College of Science University of Tehran, Iran.
²Faculty of Mathematical Science, Tarbiat Modares University, Tehran, Iran.
³Bioinformatics Research Group, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

In this research we introduce a spatial model on a regular lattice based on multiple sequence alignment (MSA) for assignment of a protein sequence to a protein family. In this model, we assume that both the top and the bottom residues of each amino acid, in a profile of aligned protein sequences, contain useful information due to evolutionary relationship. For this purpose, we use top twenty profiles in the Pfam database to assess the performance of our spatial model in protein assignment to protein families. We then compare our model with profile hidden Markov model (PHMM). Results show that using spatial model will increase the accuracy of protein sequence assignments considerably.

S6-4

Computational Approach for Protein Structure Prediction

Amouda Nizam¹, G.Jeyakodi1, C.Manimozhi ¹

¹Centre for Bioinformatics, Pondicherry University, India

Genetic algorithm (GA) is used to solve difficult optimization problem of huge space where little is known in various domain and biological field is no exception. Many variants of Standard GA (SGA) are applied to a complex problem like Protein Structure Prediction (PSP) which is identified as NP-hard problem in molecular biology. Unfortunately SGA requires a special attention by the non-domain experts for the right choice of values for the parameter setting manually to reach a better solution. This research proposes a novel algorithm (SOGA) by blending a self-organizing concepts and GA in order to automate the appropriate choice of the parameter values. The proposed algorithm is developed with the entire knowledge of the problem (PSP) and the selection of different parameters is based on the problem and fitness value acquired in each generation. SOGAPSP is validated by comparing the native and predicted structure of protein. The minimal energy value of predicted protein structure indicates the stability of molecule. The Rampage server result implies the confirmation psi and phi angles of the predicted protein structure are feasible for amino acid residues in protein structure. The RSMD value indicates the similar conformation with the native structure of protein. The efficiency of the proposed algorithm reduces the time requirement for optimizing the parameter values to avoid premature convergence by self organizing the genetic operators of GA. The application of this algorithm to protein structure prediction achieved better results by self organizing the cross-over rates and mutation. Exceptionally there is no requirement of known structure to predict the unknown structure.

S7-1

Revealing Molecular Mechanism of Rare Mental Disorders

Zhe Zhang^1,2, Shawn Witham¹, Margo Petukh¹, Gautier Moroy², Maria Miteva², Yoshihiko Ikeguchi³, Emil Alexov¹

¹Computational Biophysics and Bioinformatics, Department of Physics, Clemson University, Clemson, SC 29634, USA
²Universite Paris Diderot, Sorbonne Paris Cite, Molecules Therapeutiques In Silico, Inserm UMR-S 973, 35 rue Helene Brion,75013 Paris, France
³ Faculty of Pharmaceutical Sciences, Josai University, Japan

Intellectual disability (ID) is a disease which is characterized by significant limitations in cognitive abilities and social/behavioral adaptive skills. It is one of the primary reasons for pediatric, neurologic, and genetic referrals. Particularly, with respect to the protein-encoding genes on the X chromosome, it was shown that approximately 10% of them have been implicated in ID, and the corresponding ID is termed X-linked ID (XLID). Although the numbers of mutations and reported families are small and XLID is a rare disease, collectively the impact of XLID is significant, because the patients almost always cannot fully participate in society. Here we report our findings of the effects of missense mutations of wild type properties of proteins and protein complexes involved in XLID. Using various in silico methods we reveal the molecular mechanism of XLID for cases involving proteins with available 3D structure. The 3D structures were used to predict the effect of disease-causing missense mutations on the folding free energy, conformational dynamics, hydrogen bond network and, if appropriate, on protein binding free energy. It is shown that vast majority of XLID mutation sites are outside the active pocket and are accessible from the water phase providing the opportunity that their effect can be altered by binding appropriate small molecules to the vicinity of the mutation site. This observation is used to demonstrate, computationally and experimentally, that a particular case, the Snyder-Robinson Syndrome causing G56S spermine synthase mutation, can be rescued by small molecule binding.

S7-2

Comparative Genomics Revealed General Evolutionary Trends of Insulin

ElbashirAbbas¹, Junbeom Kim¹, Yan Zhang², Luonen Chen², Ho-Jin Choi¹

¹Knowledge Engineering and Collective Intelligence Lab.(KECI), Dept., of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 305-701, Korea, ²Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences(SIBS), Chinese Academy of Sciences, Shanghai 200233, China

Since its discovery, the hormone insulin has been associated with several diseases that plague man. The most famous of these is diabetes mellitus. As of last year 346 million people worldwide have been diagnosed with diabetes. No permanent treatment exists, and 80% of deaths are due to an inability in acquiring the chronic treatment. Previous studies have not thoroughly attempted to identify the origins of insulin, and with the recent discoveries and advances in available data it is possible to perform such a study and determine the evolution of this peptide. In addition, comparative studies have identified an overlooked an aspect in insulin that has not been thoroughly investigated. Namely, the new properties attributed to C-peptide, a subunit of the precursor of insulin. In this paper we present a comparative study between vertebrates and invertebrates with regards to the insulin precursor and insulin receptor. Our goal is to determine insulin origins and evolution across vertebrates and invertebrates by performing a comparative study of the insulin precursor and receptor in these species. Phylogenetic trees were constructed to visualize and determine the level of conservation of proinsulin and c-peptide and their respective distribution across different vertebrates. We have determined that both vertebrates and invertebrates contain insulin or insulin like proteins, however there number may differ, the coding patterns differ and the physical composition of C-peptide differs. Also the interacting insulin and insulin receptor residues found in both species classes show that some are conserved among both, but the majority are different.. Further work is required to expand on the results acquired and add to the insights gained.

S7-3

An Information-Gain Approach to Detecting Three-Way Epistatic Interactions in Genetic Association Studies

Ting Hu¹, Yuanzhu Chen^1,2, Jeff W. Kiralis¹, Ryan L. Collins¹, Christian Wejse³, Giorgio Sirugo⁴, Scott M. Williams^1,5 and Jason H. Moore^1,
5

¹Department of Genetics, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
²Department of Computer Science, Memorial University, St. John’s, NL, Canada
³Center for Global Health, School of Public Health, Aarhus University, Skejby, Denmark
⁴Centro di Genetica, Centro di Ricerca Scientifica, Ospedale San Pietro FBF, Rome, Italy
⁵Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, USA

Epistasishas been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis using information gain. It is able to separate all lower-order effects from pure three-way epistasis. Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis (TB) in a West African population. In the TB data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.

S7-4

Rare Variant Analysis Using Publically Available Biological Knowledge

Carrie B. Moore^1,2, John R. Wallace², Alex T. Frase², Sarah A. Pendergrass², Marylyn D. Ritchie²

¹Center for Human Genetics Research, Vanderbilt University, Nashville, TN 37232, USA,
²Center for Systems Genomics, Pennsylvania State University, University Park, PA 16802, USA

With the recent flood of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. We developed a flexible collapsing method inspired by biological knowledge called BioBin. We also built the Library of Knowledge Integration (LOKI), a repository of data assembled from public databases, which contains resources such as: the National Center for Biotechnology (NCBI) dbSNP and gene Entrez database information, Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, Gene Ontology (GO), Protein families database (Pfam), NetPath -signal transduction pathways, Molecular INTeraction database (MINT), Biological General Repository for Interaction Datasets (BioGrid), Pharmacogenomics Knowledge Base (PharmGKB), Open Regulatory Annotation Database (ORegAnno), and information from UCSC Genome Browser about evolutionary conserved regions (ECRs). BioBin can apply multiple levels of burden testing, including: functional regions, evolutionary conserved regions, genes, and/or pathways. We tested BioBin using simulated data as well as with low coverage data from the 1000 Genomes Project to evaluate bins with simulated causative variants and conducted a pairwise comparison of rare variant (MAF < 0.03) burden differences between Yoruba individuals (YRI) and individuals of European descent (CEU). Lastly, we analyzed NHLBI GO Exome Sequencing Project Kabuki dataset, with sequenced data from individuals with Kabuki syndrome, a congenital disorder affecting multiple organs and often intellectual disability, contrasted with 1000 genomes data as controls. BioBin is proving to be a very useful and flexible tool to analyze sequence data and uncover novel associations with complex disease.

S8-1

Personalized Chemotherapy for Ovarian Cancer by Integrating Genomic Data with Clinical Data

Youngchul Kim¹, Kian Behbakht², Jennifer R. Diamond², Dan Theodorescu², Jae K. Lee¹

¹Department of Public Health Sciences, University of Virginia, PO Box800717, Charlottesville, VA 22908, USA ²University of Colorado Cancer Center, University of Colorado Denver, Box 8117, Aurora, CO 80045, USA

Despite multiple standard chemotherapy drugs and novel agents, the overall therapeutic response of advanced Epithelial Ovarian Cancer (EOC) patients has been stagnant over the last two decades. Aggressive tumors such as EOC are highly heterogeneous in their therapeutic responses, so overall therapeutic responses are not likely to be improved much if used without selection. Previous biomarker studies of drug response were limited as it was difficult to develop single drug predictors based on patients treated with multiple drugs. Additionally, outcomes were often confounded with other factors beyond given therapies. By directly combining patients’ therapeutic outcome information with the COXEN algorithm based on each drug’s cell line activity data, we have developed integrated predictors of three standard chemotherapy drugs in treating EOC: paclitaxel, cyclophosphamide, and topotecan. Our integrated COXEN predictors of the three drugs demonstrated high predictability simultaneously on patients’ short-term therapeutic responses and long-term survival outcomes. In particular, when the three drug predictors were hypothetically used for a historical patient cohort, overall survival and progression-free survival of the cohort would have been prolonged more than one year and five months, respectively. When examined for patients with recurrent disease, overall survival was improved more than 21 months. While the current study still remains within analytic potential due to relatively small sample sizes for rigorous evaluation of some of these predictors, the study has shown a possibility that overall therapeutic response and outcome can be dramatically improved by optimally utilizing these integrated predictors for individual patients with EOC.

S8-2

The Role of Genetic Heterogeneity and Epistasis in Bladder Cancer Susceptibility and Outcome: A Learning Classifier System Approach

Ryan J. Urbanowicz¹, Angeline S. Andrew¹, Margaret R. Karagas¹, Jason H. Moore¹

¹Geisel School of Medicine, Dartmouth College, 1 Medical Center Dr., Lebanon, NH 03756

Detecting complex patterns of association between genetic or environmental risk factors and disease risk has become an important target for epidemiological research. In particular, strategies that accommodate multifactor interactions or heterogeneous patterns of association can offer new insights in association studies wherein traditional analytic tools have had limited success. In an effort to concurrently address these phenomena, previous work has successfully considered the application of learning classifier systems (LCSs), a flexible class of evolutionary algorithms that distributes learned associations over a population of rules. Subsequent work addressed the inherent problems of knowledge discovery and interpretation within these algorithms, allowing for the characterization of heterogeneous patterns of association. While these previous advancements were evaluated using complex simulation studies, this study applied these collective works to a real world genetic epidemiology study of bladder cancer susceptibility. Notably, we replicated the identification of previously characterized factors that modify bladder cancer risk: i.e. single nucleotide polymorphisms (SNPs) from a DNA repair gene, and smoking. Furthermore, we identified potentially heterogeneous groups of subjects characterized by distinct patterns of association. Cox proportional hazard models comparing clinical outcome variables between the cases of the two largest groups yielded a significant, meaningful difference in survivorship. A marginally significant difference in time to recurrence was also noted. These results support the hypothesis that an LCS approach can offer greater insight into complex patterns of association. This methodology appears to be well suited to the dissection of disease heterogeneity, a key component in the advancement of personalized medicine.

S8-3

Multiclass cancer classification using gene expression comparisons

Sitan Yang ¹ and Daniel Q. Naiman²

^1,2Applied Mathematics and Statistics Department, Johns Hopkins University, Baltimore, Maryland 21218, USA

As our knowledge of cancer has grown, its heterogeneous nature has become increasingly apparent, and there has been an accompanying tendency to identify and differentiate various cancer subtypes. In this situation, microarray-based cancer classification poses new methodological and computational challenges, and the identification of novel and effective approaches to multiclass classification deserves greater attention. While cancer classification has achieved considerable success in binary problems, the situation for multiclass problems is not as clear. In this paper, we introduce a new approach to multiclass cancer diagnosis based on gene expression profiles. Our method focuses on detecting a small set of genes whose expression levels have significant changes relative to each other from class to class. For a k-class problem, the decision rule only depends on the relative orderings of expression values of k genes and is transparent enough to be immediately explored for biological discoveries. We demonstrate on five cancer datasets that our method, while simple, is as powerful as many popular but complex classifiers. Furthermore, we show that the decision rules built on these datasets involve some informative genes that are known to have biological relevance for some cancer types, which may help us understand their potential mechanisms.

S8-4

Curation-Free Biomodules Mechanisms in Prostate Cancer Predict Recurrent Disease

James L. Chen¹, Alexander Hsu^1,2, Xinan Yang1, Jianrong Li², Gurunadh Parinandi², Haiquan Li², Yves A. Lussier^1,2,3

¹Ctr for Biomed. Informatics and Dept. of Medicine, The University of Chicago, Chicago, IL
²Depts of Medicine & of Bioengineering, University of Illinois at Chicago, Chicago, IL
³University of Illinois Hospital and Health Science System

Motivation: Gene expression-based prostate cancer gene signatures of poor prognosis are hampered by lack of gene feature reproducibility and a lack of understandability of their function. Molecular pathway-level mechanisms are intrinsically more stable and more robust than an individual gene. The Functional Analysis of Individual Microarray Expression (FAIME) we developed allows distinctive sample-level pathway measurements with utility for correlation with continuous phenotypes (e.g. survival). Further, we and others have previously demonstrated that pathway-level classifiers can be as accurate as gene-level classifiers using curated genesets that may implicitly comprise ascertainment biases (e.g. KEGG, GO). Here, we hypothesized that transformation of individual prostate cancer patient gene expression to pathway-level mechanisms derived from automated high throughput analyses of genomic datasets may also permit personalized pathway analysis and improve prognosis of recurrent disease.

Results: Via FAIME, three independent prostate cancer gene expression arrays with both normal and tumor samples were transformed into two distinct types of molecular pathways mechanism and then compared: (i) the curated Gene Ontology (GO) and (ii) dynamic expression activity networks of cancer (Cancer Modules). FAIME-derived mechanisms for tumorigenesis were then identified. Curated GO and computationally generated “Cancer Module” mechanisms overlap significantly and are enriched for known oncogenic deregulations and highlight potential areas of investigation. We further show in two independent datasets that these pathway-level tumorigenesis mechanisms can identify men who are more likely to develop recurrent prostate cancer (log-rank_p=0.019 and 0.04, respectively).

S9-1

Comparison and Validation of Genomic Predictors for Anticancer Drug Sensitivity

Simon Papillon-Cavanagh¹, Nicolas De Jay¹, Nehme Hachem¹, Catharina Olsen², Gianluca Bontempi², Hugo Aerts³, John Quackenbush⁴, Benjamin Haibe-Kains¹

¹Bioinformatics and Computational Genomics Laboratory, Institut de recherches cliniques de Montreal, University of Montreal, Montreal, Quebec, Canada
²Machine Learning Group, Universite Libre de Bruxelles, Bruxelles, Belgium
³Department of Radiation Oncology and 4 Department or Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard University, Boston, MA, USA,

An enduring challenge in personalized medicine lies in selecting the right drug for each individual patient. While direct testing of drugs on patients is the only way to assess their clinical efficacy and toxicity, we dramatically lack resources to test the hundreds of drugs that are currently under development. Therefore the use of preclinical model systems has been intensively investigated as this approach enables to test response to hundreds of drugs in multiple cell lines in parallel. Recently two large-scale pharmacogenomic studies screened multiple anticancer drugs on more than 1000 cell lines. Here we propose to combine these datasets to build and robustly validate genomic predictors of drug response. We compared five different approaches for building predictors of increasing complexity. We assessed their performance in cross-validation and in two large validation sets, one containing the same cell lines present in the training set and another dataset composed of cell lines that have never been used during the training phase. Sixteen drugs were found in common between the datasets. We were able to validate multivariate predictors for four out of the sixteen tested drugs, namely Irinotecan, PD-0325901, PLX4720 and Lapatinib. Moreover, we observed than response to 17-AAG, an inhibitor of Hsp90, could be efficiently predicted by the expression level of a single gene, NQO1. Altogether these results suggest that predictors could be robustly validated for specific drugs. If successfully validated in patients’ tumor cells, and subsequently in clinical trials, they could act as companion tests for the corresponding drugs and play an important role in personalized medicine.

S9-2

Improve Binding Affinity by Twin Adhesive Drugs Mined in-between Docking Bio-mimicry Omega-shape Nona-peptide Agretope on HLA-1 Pit

Chun-Fan Chang¹, Chen-Chieh Fan^2,3, Hsueh-Ting Chu⁴ and Cheng-Yan Kao²

¹Department of Animal Science, Chinese Culture University, Taipei 11114, Taiwan;
²Department of Computer Science and Information Engineering, National Taiwan University, Taipei 10617, Taiwan; and ³ENT Division, National Taiwan University Hospital, Taipei 10002, Taiwan.
⁴Department of Computer Science and Information Engineering, Asia University, Taichung 41354, Taiwan.

Motivation: The oncogenesis process of nasopharyngeal carcinoma (NPC) may equip proliferation advantage and immune evasion in overcoming efficient host immune clearance mechanisms against Epstein Barr virus (EBV). The proliferation advantage is likely from encoding EBV latent infection phase membrane protein 1 (LMP1) and the immune evasion is likely from mutating EBV genome for poor immune reactivity at AMI-antigen epitopes and CMI-antigen epitopes/agretopes of LMP1/LMP2 and EBNA upon class I human leukocyte antigen (HLA-1) IIn this work, we developed a structure-based immunoinformatic tool of EBV-LMP1 related omega-shape nona-peptide (LMP1np) design for docking HLA-1 pit towards mining twin adhesive drugs (TAD) with improved binding affinity (BAff).

Results: Our implemented bio-mimicry peptide design algorithm tool (bmPDA tool) designs nona-peptide structures with bulge-side epitope and anchor-side agretope from LMP-1 and NLMP-1 segments for docking HLA-1 of A*0201 and A*0207. The design efficiency of bio-mimicry peptide by bmPDA tool is demonstrated with preliminary reference nona-peptide structure of vasopressin protein. The binding affinity (BAff) between putative agretope and verified HLA1 pit shows notable weakening for likely immune evasion in the cases of A*0207 and NLMP1 at initial amino acid positions of 32, 35, 86, 92, 125, 147, and 166. In that, our algorithm mines twin adhesive drugs (TAD) among FDA-approval list exemplified with Nizatidine, Benzonatate, Entecavir, Famotidine, and Alprostadil for improving BAff between A*0207 pit and weak agretope of NLMP1np structures.

S9-3

Altering Physiological Networks using Drugs: Steps towards Personalized Physiology

Adam D Grossman, PhD¹, Mitchell J Cohen, MD², Geoffrey T Manley, MD, PhD³, Atul J Butte, MD, PhD⁴

¹Department of Bioengineering, Stanford University, Stanford, CA, USA
²Department of Surgery, University of California San Francisco, San Francisco, CA, USA
³Department of Neurosurgery, University of California San Francisco, San Francisco, CA, USA
⁴Department of Pediatrics and the Department of Medicine, Stanford University School of Medicine, Stanford, CA, and Lucile Packard Children's Hospital, Palo Alto, CA, USA.

The rise of personalized medicine has reminded us that each patient must be treated as an individual. One factor in making treatment decisions is the physiological state of each patient, but definitions of relevant states and methods to visualize state-related physiologic changes are scarce. We constructed correlation networks from physiologic data to demonstrate changes associated with pressor use in the intensive care unit. We collected 29 physiological variables at one-minute intervals from nineteen trauma patients in the intensive care unit of an academic hospital and grouped each minute of data as receiving or not receiving pressors. For each group we constructed Spearman correlation networks of pairs of physiologic variables. To visualize drug-associated changes we split the networks into three components: an unchanging network, a network of connections with changing correlation sign, and a network of connections only present in one group. Out of a possible 406 connections between the 29 physiological measures, 64, 39, and 48 were present in each of the three component networks. The static network confirms expected physiological relationships while the network of associations with changed correlation sign suggests putative changes due to the drugs. The network of associations present only with pressors suggests new relationships that could be worthy of study. We demonstrated that visualizing physiological relationships using correlation networks provides insight into underlying physiologic states while also showing that many of these relationships change when the state is defined by the presence of drugs. This method applied to targeted experiments could change the way critical care patients are monitored and treated.

S9-4

Compensating for Literature Annotation Bias when Predicting Novel Drug-Disease Relationships through Medical Subject Heading Over-representation Profile (MeSHOP) Similarity

Warren A. Cheung^1,2, BF Francis Ouellette^3,4, Wyeth W. Wasserman^1,5

¹Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, University of British Columbia, Vancouver, BC, Canada, ²Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada, ³Ontario Institute for Cancer Research, Toronto, ON, Canada, ⁴Department of Cells and Systems Biology, University of Toronto, Toronto, ON, Canada, ⁵Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada

Medical Subject Heading Overrepresentation Profiles (MeSHOPs) quantitatively summarise the literature associated with biological entities such as diseases or drugs. A profile is constructed by counting the number of times each MeSH term is assigned to an entity-related research publication in the MEDLINE/PUBMED database and calculating the significance of the count relative to a background expectation. Based on the expectation that drugs suitable for treatment of a disease (or disease symptom) will have similar annotation properties to the disease, we successfully predict drug-disease associations by comparing MeSHOPs of diseases and drugs. The MeSHOP comparison approach delivers an 11% improvement over bibliometric baselines. However, novel drug-disease associations are observed to be biased towards drugs and diseases with more publications. To account for the annotation biases, a correction procedure is introduced and evaluated. By explicitly accounting for the annotation bias, unexpectedly similar drug-disease pairs are highlighted as candidates for drug repositioning research.

S10-1

Detection of Pleiotropy through a Phenome-Wide Association Study (PheWAS) in the National Health and Nutrition Examination Surveys (NHANES)

M.A. Hall¹, A. Verma¹, K.D. Brown-Gentry², R. Goodloe², J. Boston², S. Wilson², B. McClellan², C. Sutcliffe², H.H. Dilks^2,3, N.B. Gillani², H. Jin², P. Mayo², M. Allen², N. SchnetzBoutaud², D.C. Crawford^2,3, M.D. Ritchie¹, S.A. Pendergrass¹

¹Center for Systems Genomics, Department of Biochemistry and Molecular Biology, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA;
²Center for Human Genetics Research, ³Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville TN, USA

Herein we describe the results of a Phenome-wide association study (PheWAS) utilizing the diverse genotypic and phenotypic data that exists for multiple race-ethnicites in the National Health and Nutrition Examination Surveys (NHANES), conducted by the Centers for Disease Control and Prevention (CDC) and accessed by the Epidemiological Architecture for Genes Linked to Environment (EAGLE) study. PheWAS is a novel approach for discovering the complex mechanisms involved in human disease by testing SNPs for association with a large and diverse set of phenotypes. Comprehensive unadjusted tests of association were performed in NHANES III and NHANES 1999-2002 for 575 SNPs with 1009 phenotypes stratified by race-ethnicity. We identified 51 PheWAS associations that were consistent between the two surveys for the same SNP, phenotype-class, direction of effect, and race-ethnicity with p<0.01, allele frequency > 0.01, and sample size > 200. Of these, 28 replicated previously reported SNP-phenotype associations, 9 were related to previously reported associations in the literature, and 14 were novel SNP-phenotype associations. We also identified SNPs associated with multiple novel phenotypes. These results demonstrate the utility of phenome-wide association studies for exploring associations between genetic variation and phenotypic variation in a high throughput and comprehensive manner using existing epidemiologic study data. The results of PheWAS promise to expose more of the genetic architecture underlying multiple traits and generate hypotheses about pleiotropic interactions for future research.

S10-2

Analysis of Type 2 Diabetes GWAS Dataset using Expanded Gene Set Enrichment Analysis and Protein-Protein Interaction Network

Chiyong Kang¹, Hyeji Yu¹,Gwan-Su Yi¹

¹Department of Bio and Brain Engineering, KAIST, Daejeon 305701, Korea

Genome-wide association studies (GWAS) have been identified approximately 40 type 2 diabetes (T2D) associated SNPs. However, only small fraction of the T2D genetic risk is explained with identified T2D associated SNPs. While pathway enrichment analysis that considers multiple SNPs is suggested to reveal the mechanisms of complex diseases, pathway gene set can cover only small portion of human genes. For the better understanding of biological mechanisms of T2D and T2D causal gene detection, enrichment analysis with expanded gene sets and mapping GWAS based T2D associated gene into protein-protein interaction (PPI) network are proposed. Gene set enrichment analysis (GESA) is applied on WTCCC T2D GWAS dataset with expanded gene sets including pathway, function, TF-target, miRNA-target and complex. From expanded GSEA, 451 T2D associated gene sets are detected with p-value < 0.05 and 441 gene sets out of selected 451 gene sets contain known T2D genes. To find novel T2D gene candidates, 64 GWAS based T2D associated genes which are from 2,960 SNPs with p-value threshold 0.05 in WTCCC T2D GWAS dataset are mapped into integrated PPI network and total 24 novel T2D gene candidates are detected. Among detected T2D gene candidates, GBR2 is the most associated gene with T2D. Expanded GSEA and PPI mapping of GWAS based T2D associated genes showed the possibility of providing insights of T2D mechanisms and detecting novel T2D gene candidates.

S10-3

Integrative Analysis of Congenital Muscular Torticollis: from Gene Expression to Clinical Indication

Shin-Young Yim, MD, PhD¹, Dukyong Yoon, MD, MS², Myong Chul Park, MD, PhD³, Il Jae Lee, MD, PhD³, Jang-Hee Kim, MD, MS⁴, Myung Ae Lee,PhD⁵, Kyu-Sung Kwack, MD, PhD⁶, Jan-Dee Lee, MD, PhD⁷, Euy-Young Soh, MD, PhD⁸, Young-In Na, MS⁹, Rae Woong Park, MD, PhD², KiYoung Lee, PhD², and Jae-Bum Jun, MD, PhD⁹

¹The Center for Torticollis, Department of Physical Medicine and Rehabilitation, Ajou University School of Medicine, Suwon, Republic of Korea
²Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
³Department of Plastic and Reconstructive Surgery, Ajou University School of Medicine, Suwon, Republic of Korea
⁴Department of Pathology, Ajou University School of Medicine, Suwon, Republic of Korea
⁵Brain Disease Research Center, Ajou University School of Medicine, Suwon, Republic of Korea
⁶Department of Radiology, Ajou University School of Medicine, Suwon, Republic of Korea
⁷ Department of Surgery, Eulji General Hospital, Seoul, Republic of Korea
⁸Department of Surgery, Ajou University School of Medicine, Suwon, Republic of Korea
⁹Department of Rheumatology, The Hospital for Rheumatic Diseases, Hanyang University College of Medicine, Seoul, Republic of Korea

Congenital muscular torticollis (CMT) is characterized by thickening and/or tightness of the unilateral sternocleidomastoid muscle (SCM), ending up with torticollis. Our aim was to discover differentially expressed genes (DEGs) and novel protein interaction network modules of CMT and to discover the relationship between gene expressions and clinical severity of CMT or protein expressions encoded by DEG. Twenty-three sternocleidomastoid muscle (SCM) of CMT patients and 5 normal SCMs were allocated for microarray, MRI, or imunohistochemical studies. We identified 269 genes as the DEGs in CMT. Gene ontology enrichment analysis revealed that the main function of the DEGs is for extracellular region part during developmental processes. Five CMT-related protein network modules were identified, which showed that the important pathway is fibrosis related with collagen and elastin fibrillogenesis with an evidence of DNA repair mechanism. The expression levels of some meaningful DEGs showed good correlation with the pre-operational MRI color intensities of CMT, indicating clinical severity. Moreover, the protein expressions encoded by the DEGs confirmed the different gene expressions of CMT. We provided an integrative analysis of CMT from gene expression to clinical indication, which showed good correlation with clinical severity of CMT. Furthermore, the CMT-related protein network modules were identified, which provided more in-depth understanding of pathophysiology of CMT.

S10-4

Detecting early-warning signals of type 1 diabetes and its leading biomolecular networks by dynamical network biomarkers

Xiaoping Liu^1,2, Rui Liu^3,4, Xing-Ming Zhao², Luonan Chen^1,2,4

¹Key Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research Centre for PreDiabetes, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China;
²Institute of Systems Biology, Shanghai University, Shanghai 200444, China;
³Department of Mathematics, South China University of Technology, Guangzhou 510640, China;
⁴Collaborative Research Center for Innovative Mathematical Modelling, Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan

Type 1 diabetes is a complex disease and harmful to human health, and most of the existing biomarkers are mainly to measure the disease phenotype after the disease onset (or drastic deterioration). Until now, there is no effective biomarker which can predict the upcoming disease (or pre-disease state) before disease onset or disease deterioration. Further, the detail molecular mechanism for such deterioration of the disease, e.g., driver genes or causal network of the disease, is still unclear. In this study, we detected early-warning signals of type 1 diabetes and its leading biomolecular networks based on serial gene expression profiles of NOD mice by identifying new type of biomarkers, i.e., dynamical network biomarkers which form a specific module for marking the time period just before the drastic deterioration of type 1 diabetes. Specifically, two dynamical network biomarkers were obtained to signal the emergence of two critical deteriorations for the disease, and could be used to predict the upcoming sudden changes during the disease progression. We found that the two critical transitions led to peri-insulitis and hyperglycemia in NOD mices, which are consistent with the experimental results. Hence, the identified dynamical network biomarkers can be used to detect the early-warning signals of type 1 diabetes and predict upcoming disease onset before the drastic deterioration. In addition, we also demonstrated that the leading biomolecular networks are causally related to the initiation and progression of Type 1 diabetes, and provide the biological insight into the molecular mechanism of type 1 diabetes. Experimental data and Functional analysis on DNBs validated the computational results.

Creating subnetworks from transcriptomic data on central nervous system conditions informed by a massive transcriptomic network.

Yaping Feng¹, Judith A. Syrkin-Nikolau², Eve S. Wurtele¹

¹Iowa State University, Department of Genetics, Development and Cell Biology, Ames, IA 50011, USA, ² Macalester College, MN, 55105

We use a human pairwise co-expression matrix derived from a large dataset (>18,000 samples) of high quality publicly available transcriptomic data representing relationships in gene expression across a diverse set of biological conditions (1) as a context network to explore CNS transcriptomics. In oneapproach, we derive a network from within the CNS samples, derive gene clusters, and compare thesignificance of these to the clusters derived from the larger network. In the second approach, we identifygenes that characterize individual subsets of samples from within a disease condition. Specifically, differences in gene expression within and between to two designations of glial cancer, astrocytoma and glioblastoma, are evaluated in the context of the broader network. Such related groups of genes, termedoutlier-networks tease out abnormally expressed genes and the particular samples they are associated with. This study identifies a set of 48 subnetworks of outlier genes belong to astrocytoma and glioblastoma.As a case study, we investigate the relationships among the genes of a small astrocytoma-only subnetwork.