목차

Semantic similarity measures

p(t1)={freq(t1)}/{N} ~~~~IC(t1)=-logp(t1)

N : number of annotation

Resnik, Lin, Jiang&Conrath :

P. Resnik
Using Information Content to Evaluate Semantic Similarity in a Taxonomy 
Proc. 14th Int’l Joint Conf. Artificial Intelligence, pp. 448-453, 1995.



S_Resnik (t1,t2)={max}under{c in S(t1,t2)}(-logp(c))


A drawback of the Resnik measure is that it does not differentiate between two terms if their subsumer is the same.

D. Lin
An Information-Theoretic Definition of Similarity
Proc. 15th Int’l Conf. Machine Learning, pp. 296-304, 1998.



S_Lin (t1,t2)={max}under{c in S(t1,t2)}({2*logp(c)}/{logp(t1)+logp(t2)})

J.J. Jiang and D.W. Conrath
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy 
Proc. Int’l Conf. Research in Computational Linguistics, ROCLING X, 1997.



S_{Jinag&Conrath} (t1,t2)={max}under{c in S(t1,t2)}({2*logp(c)}-logp(t1)-logp(t2))


As we argued with respect to Jiang, we encounter an important drawback when both gene products happen to be shallowly annotated: similarity values would appear to be deceptively high, while this might not in fact be true. For example, if we had two genes annotated by the term “intracellular,” (GO:0005622), and no additional annotation was available, Jiang distance will be nil, and Lin similarity will be one. Although both measures provide excellent similarity results, in reality, both gene products are likely to be quite different.

Schlicker :

A new measure for functional similarity of gene products based on Gene Ontology.
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T
BMC Bioinformatics7() p302 (2006 Jun 15) 10.1186/1471-2105-7-302

S_Rel (t1,t2)={max}under{c in S(t1,t2)}({{2*logp(c)}/{logp(t1)+logp(t2)}}*(1-p(c)))

GOTax: investigating biological processes and biochemical activities along the taxonomic tree.
Schlicker A, Rahnenführer J, Albrecht M, Lengauer T, Domingues FS
Genome Biol8(3) pR33 (2007) 10.1186/gb-2007-8-3-r33

FunSimMat: a comprehensive functional similarity database.
Schlicker A, Albrecht M
Nucleic Acids Res36(Database issue) pD434-9 (2008 Jan) 10.1093/nar/gkm806

FunSimMat update: new features for exploring functional similarity.
Schlicker A, Albrecht M
Nucleic Acids Res38(Database issue) pD244-8 (2010 Jan) 10.1093/nar/gkp979

LinK :

S(ti,t2) = {{maxDepth^GO }/{maxDepth^GO + gamma}}*{{alpha}/{alpha + beta}}

alpha={max}under{path_m in Paths(term_i ), path_n in Paths(term_j )} delim{lbrace}{{the number of common terms}under{between path_m and path_n }}{rbrace}-1

beta=max delim{lbrace}{{min}under{u in U} delim{lbrace}{dist(term_i , u)}{rbrace}, {min}under{v in V} delim{lbrace} {dist(term_j , v)} {rbrace} }{rbrace}

where U = delim{lbrace}{all leaf nodes descending from term_i }{rbrace} ~and V = delim{lbrace}{all leaf nodes descending from term_j }{rbrace}
gamma=dist(MRCA, term_i )+dist(MRCA, term_j )~~~MRCA : Most Recent Common Ancestor

CoutinhoP

Measuring semantic similarity between Gene Ontology terms

Couto F, Silva M, Coutinho P
Data & Knowledge Engineering 61 p137-152 (2007 April)

Example GO tree




omega_e :is a - 0.8, part of - 0.6


S-values for GO terms in DAG for term Intracellular membrane-bound organelle:0043231

GO terms S-value
43231 1.0
43229 0.8
43227 0.8
5622 0.48
5623 0.288
43226 0.64
5575 0.512

S-values for GO terms in DAG for term Intracellular organelle:0043229

GO terms S-value
43229 1.0
5622 0.6
5623 0.36
43226 0.8
5575 0.64



Semantic similarity between GO term A and B:


S_GO(A,B) = {Sigma_{t in T_A inter T_B}{(S_A (t) + S_B (t))}} / {SV(A)+SV(B)}

LussierYA :

use Lin's semantic similarity, but use occurrence probability of a term

p(c)={(1+number of all descendants of c)}/{total number of concepts in an ontology}

Yang & Kim :

sim_{DF} (A,B)=sim_{wang} (A,B)*DF

Meeta Mistry :

TO : Term Overlap


annot_g1 : set of all direct annotations for each gene and all of their associated parent terms (excluding root)


sim_TO(g1,g2) = |annot_g1 inter annot_g2 |

Sidahmed Benabderrahmane :

IntelliGO: a new vector-based semantic similarity measure including annotation origin.
Benabderrahmane S, Smail-Tabbone M, Poch O, Napoli A, Devignes MD
BMC Bioinformatics11(1) p588 (2010 Dec 1) 10.1186/1471-2105-11-588

Jain S :

Wang H :

Ontology- and graph-based similarity assessment in biological networks.
Wang H, Zheng H, Azuaje F
Bioinformatics26(20) p2643-4 (2010 Oct 15) 10.1093/bioinformatics/btq477

Zhu W :

Semantic and layered protein function prediction from PPI networks.
Zhu W, Hou J, Chen YP
J Theor Biol267(2) p129-36 (2010 Nov 21) 10.1016/j.jtbi.2010.08.005

Tedder PM :

Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum.
Tedder PM, Bradford JR, Needham CJ, McConkey GA, Bulpitt AJ, Westhead DR
Bioinformatics26(19) p2431-7 (2010 Oct 1) 10.1093/bioinformatics/btq450

Chen Z :

Wang Z :

Revealing and avoiding bias in semantic similarity scores for protein pairs.
Wang J, Zhou X, Zhu J, Zhou C, Guo Z
BMC Bioinformatics11() p290 (2010 May 28) 10.1186/1471-2105-11-290

Sheehan B :

A relation based measure of semantic similarity for Gene Ontology annotations.
Sheehan B, Quigley A, Gaudin B, Dobson S
BMC Bioinformatics9() p468 (2008 Nov 4) 10.1186/1471-2105-9-468