Friday, June 26, 2009

In terms of the overall project, my plan is to analyze the term descriptions using natural language tools, with the output of this process being two lists of one or more tokens. A comparison function would quantify the similarity between these lists.

There seem to be two main approaches: measuring the longest subsequence, and quantifying the overlap based on shared elements.

Since the token lists can be thought of as sequences made up of tokens, analogous to the way that strings are sequences of characters, I've been looking at a few string similarity measures. The most famous of those appear to be:

1. The Levenshtein (or Edit) Distance
Characterized by the the minimum number of operations needed to transform one sequence into the other
operations include: insertion, deletion and substitution of an element (a modified version also allows transposition)

2. Jaro-Winkler Distance
Characterized by the number of elements from one sequence match elements in another sequence- in this case, the concept of matching includes not only the content of the element, but its position within the sequence. Sequences that match from the beginning are considered more similar than those that do not.
Works best on short sequences.

3. Dice Coefficient
Compares two sets. When strings are compared, the set elements are typically bigrams, rather than characters.

4. Jaccard Index:
Compares two sets by dividing their intersection by their union.

None of these jumps out as an ideal measure:
The Levenshtein Distance and Jaro-Winkler Distance emphasize the position of elements in the sequence- whereas what we're interested is more the syntactic relatedness of elements within the sequences. In addition, calculation of the Levenshtein distance appears to be somewhat time-consuming.

The Dice Coeffiecient and Jaccard Index both rely on sets, which by nature ignore duplicate elements- but the duplication of the same word in two descriptons may be a good indicator of similarity.

Perhaps it would make sense to modify one or more of these to create an applicable similarity measure or measures?

I also had a look at the Lucene similarity class- it looks good, but it'd take a lot of reworking, because it's intended to match pre-indexed documents to a set of keywords, rather than to compare two lists of tokens.

Any ideas? What am I overlooking?

1 comment:

  1. OK. Got attached to this and an RSS Feed set up in Outlook (the assimilation scars have mostly healed :-) ).