
P. Resnik Using Information Content to Evaluate Semantic Similarity in a Taxonomy Proc. 14th Int’l Joint Conf. Artificial Intelligence, pp. 448-453, 1995.

A drawback of the Resnik measure is that it does not differentiate between two terms if their subsumer is the same.
D. Lin An Information-Theoretic Definition of Similarity Proc. 15th Int’l Conf. Machine Learning, pp. 296-304, 1998.

J.J. Jiang and D.W. Conrath Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy Proc. Int’l Conf. Research in Computational Linguistics, ROCLING X, 1997.

As we argued with respect to Jiang, we encounter an important drawback when both gene products happen to be shallowly annotated: similarity values would appear to be deceptively high, while this might not in fact be true. For example, if we had two genes annotated by the term “intracellular,” (GO:0005622), and no additional annotation was available, Jiang distance will be nil, and Lin similarity will be one. Although both measures provide excellent similarity results, in reality, both gene products are likely to be quite different.




Example GO tree
is a - 0.8, part of - 0.6
S-values for GO terms in DAG for term Intracellular membrane-bound organelle:0043231
| GO terms | S-value |
|---|---|
| 43231 | 1.0 |
| 43229 | 0.8 |
| 43227 | 0.8 |
| 5622 | 0.48 |
| 5623 | 0.288 |
| 43226 | 0.64 |
| 5575 | 0.512 |
S-values for GO terms in DAG for term Intracellular organelle:0043229
| GO terms | S-value |
|---|---|
| 43229 | 1.0 |
| 5622 | 0.6 |
| 5623 | 0.36 |
| 43226 | 0.8 |
| 5575 | 0.64 |
Semantic similarity between GO term A and B:
use Lin's semantic similarity, but use occurrence probability of a term
TO : Term Overlap
: set of all direct annotations for each gene and all of their associated parent terms (excluding root)