Tuesday, July 7, 2009

similarity scoring

After doing a lot of reading, I've decided to use cosine similarity to quantify the similarity between lists of tokens.

Cosine similarity is characterized by the cosine of the angle between two vectors that represent the token lists.

e.g. If area and volume are defined as:
area: length multiplied by width
volume: length multiplied by width multiplied by height

we can create a vector for each description, with as many cells as unique words in the set of BOTH descriptions (length, multiplied, by, width, height). The value in each cell is the number of occurrences of that word in the description.

So we would get something like this:
area = [1, 1 , 1 , 1, 0]
volume = [1, 2, 2, 1, 1]

From here, the calculation of the cosine is straightforward (http://upload.wikimedia.org/math/a/f/3/af3da3b8fcced86a3e7c8cf4075ed94c.png).

For the current example, the result would be 0.9 (i.e. the descriptions are extremely similar).

Advantages of using Cosine Distance
  • Widely used in linguistics
  • Easy to implement
  • Flexible: Vectors can be defined in multiple ways, e.g. by deleting stop words, adding synonyms, weighting rare words, or keeping only lexical chains.
  • Similarity is based on content, not on order (If it becomes clear that word order is important in the similarity of descriptions, we can weight the similarity value with an order-dependent coefficient).
Limitations
  • Does not take word sense into account. But this is true of most, if not all, single-equation measures. If it becomes an issue, I think that word sense can be dealt with at the time of vector formation.
UPDATE:
1. Wrote a function to calculate cosine similarity given two token lists.
2. Eclipse is working a whole lot better than IDLE. No lagging at all yet.

1 comment: