Semantic similarity measures

Semantic similarity measures

p(t1)={freq(t1)}/{N} ~~~~IC(t1)=-logp(t1)

N : number of annotation

Resnik, Lin, Jiang&Conrath :

P. Resnik
Using Information Content to Evaluate Semantic Similarity in a Taxonomy 
Proc. 14th Int’l Joint Conf. Artificial Intelligence, pp. 448-453, 1995.

$S_Resnik (t1,t2)={max}under{c in S(t1,t2)}(-logp(c))$

A drawback of the Resnik measure is that it does not differentiate between two terms if their subsumer is the same.

D. Lin
An Information-Theoretic Definition of Similarity
Proc. 15th Int’l Conf. Machine Learning, pp. 296-304, 1998.

$S_Lin (t1,t2)={max}under{c in S(t1,t2)}({2*logp(c)}/{logp(t1)+logp(t2)})$

J.J. Jiang and D.W. Conrath
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy 
Proc. Int’l Conf. Research in Computational Linguistics, ROCLING X, 1997.

$S_{Jinag&Conrath} (t1,t2)={max}under{c in S(t1,t2)}({2*logp(c)}-logp(t1)-logp(t2))$

As we argued with respect to Jiang, we encounter an important drawback when both gene products happen to be shallowly annotated: similarity values would appear to be deceptively high, while this might not in fact be true. For example, if we had two genes annotated by the term “intracellular,” (GO:0005622), and no additional annotation was available, Jiang distance will be nil, and Lin similarity will be one. Although both measures provide excellent similarity results, in reality, both gene products are likely to be quite different.

Schlicker :

A new measure for functional similarity of gene products based on Gene Ontology.
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T
BMC Bioinformatics7() p302 (2006 Jun 15) 10.1186/1471-2105-7-302

$S_Rel (t1,t2)={max}under{c in S(t1,t2)}({{2*logp(c)}/{logp(t1)+logp(t2)}}*(1-p(c)))$

GOTax: investigating biological processes and biochemical activities along the taxonomic tree.
Schlicker A, Rahnenführer J, Albrecht M, Lengauer T, Domingues FS
Genome Biol8(3) pR33 (2007) 10.1186/gb-2007-8-3-r33

FunSimMat: a comprehensive functional similarity database.
Schlicker A, Albrecht M
Nucleic Acids Res36(Database issue) pD434-9 (2008 Jan) 10.1093/nar/gkm806

FunSimMat update: new features for exploring functional similarity.
Schlicker A, Albrecht M
Nucleic Acids Res38(Database issue) pD244-8 (2010 Jan) 10.1093/nar/gkp979

Improving disease gene prioritization using the semantic similarity of Gene Ontology terms.
Schlicker A, Lengauer T, Albrecht M
Bioinformatics26(18) pi561-7 (2010 Sep 15) 10.1093/bioinformatics/btq384

LinK :

Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations.
Wu X, Zhu L, Guo J, Zhang DY, Lin K
Nucleic Acids Res34(7) p2137-50 (2006) 10.1093/nar/gkl219

$S(ti,t2) = {{maxDepth^GO }/{maxDepth^GO + gamma}}*{{alpha}/{alpha + beta}}$

$alpha={max}under{path_m in Paths(term_i ), path_n in Paths(term_j )} delim{lbrace}{{the number of common terms}under{between path_m and path_n }}{rbrace}-1$

$beta=max delim{lbrace}{{min}under{u in U} delim{lbrace}{dist(term_i , u)}{rbrace}, {min}under{v in V} delim{lbrace} {dist(term_j , v)} {rbrace} }{rbrace}$

$where U = delim{lbrace}{all leaf nodes descending from term_i }{rbrace} ~and V = delim{lbrace}{all leaf nodes descending from term_j }{rbrace}$
gamma=dist(MRCA, term_i )+dist(MRCA, term_j )~~~MRCA : Most Recent Common Ancestor