Evaluation methods for semantic similarity measure

SS : semantic similarity between genes or geneset

LordPW :

Semantic similarity measures as tools for exploring the gene ontology.
Lord PW, Stevens RD, Brass A, Goble CA
Pac Symp Biocomput() p601-12 (2003)

SS vs Sequence Similarity (BLAST result)




Correlation co-efficients between BLAST bit scores, and semantic similarity.

Aspect Resnik Lin Jiang
Molecular Function 0.577 0.541 -0.483
Biological Process 0.280 0.303 -0.312
Cellular Component 0.368 0.452 -0.414

Correlation co-efficients for semantic similarity scores over different aspects of GO.

Aspect Resnik Lin Jiang
Molecular Function - Cellular Component 0.290 0.318 0.087
Molecular Function - Biological Process 0.219 0.244 0.269
Biological Process - Cellular Component 0.202 0.175 0.166
The Resnik measure shows the highest correlation, as well as having the lowest correlation for 
the other two aspects, so it may be the most discriminatory.

RubioA :

Correlation between gene expression and GO semantic similarity.
Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martínez-Cruz LA, Corrales FJ, Rubio A
IEEE/ACM Trans Comput Biol Bioinform2(4) p330-8 (2005 Oct-Dec) 10.1109/TCBB.2005.50

SS vs Gene Expression

  1. Marsha dataset (5 samples, 907600 expression levels in total)
  2. RAD dataset (89 samples, 893827 expression levels in total)



Correlation co-efficients between Gene Expression Correlation and Semantic Similarity.

Correlation
/-\ Resnik Jiang Lin
Marsha MF 0.04 -0.05 0.04
/-\ CC 0.05 -0.06 0.05
/-\ BP 0.06 -0.03 0.05
RAD MF 0.12 0.00 0.10
/-\ CC 0.14 -0.06 0.10
/-\ BP 0.14 -0.05 0.12

Correlation Coefficients between Gene Expression Correlation and Semantic Similarity When Average Correlations Are Computed over 100 Semantic Similarity Intervals.

Correlation
/-\ Resnik Jiang Lin
Marsha MF 0.63 -0.59 0.24
/-\ CC 0.72 -0.32 0.12
/-\ BP 0.77 -0.22 0.39
RAD MF 0.47 0.16 0.28
/-\ CC 0.51 -0.23 0.34
/-\ BP 0.59 -0.14 0.41



SS vs (Random permutation of GO annotation, Gene Expression)




Correlation between Gene Expression Correlation and Semantic Similarity for Resnik Distance
in Two Randomized Experiments.

Correlation
/-\ Resnik GO Random Exp Random
Marsha MF 0.63 -0.13 0.10
/-\ CC 0.72 0.09 0.05
/-\ BP 0.77 -0.08 0.20
RAD MF 0.47 0.16 -0.03
/-\ CC 0.53 -0.23 -0.15
/-\ BP 0.61 -0.14 -0.16


These results suggest that there is an underlying relationship between gene expression and GO  
annotation. They also validate the use of Resnik semantic similarity as a measure that is well 
correlated to gene expression and can be used to augment the biological knowledge achieved 
from other sources. For instance, in the same way that we have tools that characterize genes   
according to their expression profiles or similar criteria, tools could be developed that take 
advantage of semantic similarity to enhance existing information. Semantic similarity could 
also be used to improve current clustering algorithms as well as in the development of 
a "semantic search" tool

LiebmanMN :

Assessing semantic similarity measures for the characterization of human regulatory pathways.
Guo X, Liu R, Shriver CD, Hu H, Liebman MN
Bioinformatics22(8) p967-73 (2006 Apr 15) 10.1093/bioinformatics/btl042

ROC curve analysis

  • positive dataset :
It comprises pairwise interactions among proteins of the same complex and interactions of 
neighboring proteins within KEGG human regulatory pathways. After discarding proteins with 
indirect interaction effect, the interaction nature of neighboring proteins includes 
activation, inhibition, binding/association, dissociation, state change, phosphorylation,
dephosphorylation, glycosylation, ubiquitination and methylation.
  • negative dataset :
we randomly choose two distinct human proteins from Entrez Gene database as a non-interacting 
protein pair. This is valid since the chance of identifying protein–protein interactions at 
random is very small (0.024% based on the two-hybrid data by Utez et al., 2000).

funsim :

A new measure for functional similarity of gene products based on Gene Ontology.
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T
BMC Bioinformatics7() p302 (2006 Jun 15) 10.1186/1471-2105-7-302

SS vs Sequence similrity

  • 4 different categories of protein pairs (NSS, LSS, HSS, IO)
    • NSS : no sequence similarity
    • LSS : low sequence similarity
    • HSS : high sequence similarity
    • IO : Orthology according to InParanoid which is eukaryotic ortholog database (http://InParanoid.sbc.su.se/)
In summary, these results confirm that functionally related proteins tend to have higher 
sequence similarity. This is more evident for the MFscore. Nevertheless, a considerable
percentage of protein pairs that are orthologous and that have a high sequence similarity show 
no functional similarity. The comparison with Lord's approach to combine semantic similarity 
scores shows significantly different results. In particular, the proposed approach is expected 
to provide a better discrimination between nonhomologous and orthologous proteins.


Finding functionally related proteins

  • They compared the 7,356 yeast proteins from UniProt to the 70447 proteins from human in UniProt.
  • These functionally related protein pairs are either non-homologous and evolved independently to a similar function or are remote homologs that cannot be identified by standard sequence based methods.



MDS for yeast-yeast comparison

  • Normalized stress (NS) : how well the pairwise distances are preserved in the lower dimensional space.
  • Change rate of NS (CR) : The highest CR indicates the optimal number of dimensions to represent the original dataset.



<latex>NS={{\sum_{ij}d_ij\prime - d_{ij})}^2}}\over{\sum_{ij}d_ij^2}}}</latex>


<latex>{d_{ij}}\prime</latex> is the distance of proteins i and j in the low dimensional space.


<latex>d_{ij}</latex> is the respective distance in the original space.


<latex>CR_k = {{(NS_k - NS_{k-1})}\over{(NS_{k+1} - NS_k )}}</latex>


<latex>k</latex> is the number if dimensions.

LussierYA :

Evaluation of high-throughput functional categorization of human disease genes.
Chen JL, Liu Y, Sam LT, Li J, Lussier YA
BMC Bioinformatics8 Suppl 3() pS7 (2007 May 9) 10.1186/1471-2105-8-S3-S7

Classification of HDG using GO categories

  • Dataset : Valle's human disease gene (HDG) lists
    • function categories of 923 disease subset of OMIM (Enzyme, Cell signaling etc..)
  • GO mapping
    • assigning GO category for each HDG category
    • 923 HDGs of OMIM ⇒ LocusLink ⇒ 787 HDGs had been mapped to GO
    • ⇒ 728 HDGs were assigned to the 72 selected GO terms ⇒ Valle's class
  • GO clustering
    • calculation of semantic similarity for 787 HDGs
    • clustering of 787×787 distance matrix into 14 clusters
    • mapping to Valle's class

G-SESAME :

A new method to measure the semantic similarity of GO terms.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF
Bioinformatics23(10) p1274-81 (2007 May 15) 10.1093/bioinformatics/btm087

SS vs pathway

  • Dataset : Curated manualy at Saccharomyces genome database(SGD)
  • manually clustered by their molecular functions
  • similarity values and clustering results obtained by G-SESAME are consistent with human perspectives while Resnik's method are often inconsistent with the human perception.

LussierYA :

  • Dataset
    • Comparison of ITSS to published predictive algorithms for the SGD and FlyBase datasets
      • used 165 GO terms in both the SGD and FlyBase datasets.
    • Predictions in the H.sapiens dataset
      • 2072 and 1390 distinct GO terms from the GOAr and GOAh files.
      • 13,509 and 11,076 such genes from the GOAr and GOAh files
  • CV : predict if a gene in the testing set is associated with a certain GO term, using the known annotations in the corresponding GOA files as a gold standard.

BurgunA :

SS vs pathway

  • Dataset : A relation between a third level term and a gene product in the KEGG pathway database is considered as a KEGG annotation.
  • Among the 18 transversal networks, 10 can be evaluated.
    • They are composed of at least two gene products present in KEGG(annotated)
# of transversal networks describe
6 4 KEGG annotations are identical or correspond to sibling KEGG pathways
/-\ 1 KEGG annotations correspond to closely related two-level terms
/-\ 1 annotations are different but reflect the composition of the networks into subnetworks
4 Heterogenous. However, from a biological point of view, the KEGG annotations are complementary