How do we quantify similarity between descriptions? A lot of the usual measures rely heavily on sequence order, but not on the meaning of sequence elements- which makes sense with DNA sequences and character strings, but not so much with lists of words.
Some Different Approaches From Various Fields
Translation (Between natural languages)
Uses documents with the same content in different languages in training sets. Since different languages have different word orders, there might be some similarity measures defined in these papers that are primarily content-based.
Look into ACL conference proceedings (ACLweb.org)
Uses word counts, semantic overlap to quantify similarity. This is done by stemming the words in each document, and throwing out the stop words (i.e. common words with little or no semantic content). The results are usually compared by cosine similarity.
caveat: each document should be large (whereas term descriptions are typically only a sentence or two.)
Description size could be increased by pulling in the synonyms of every word. But we'd have to be careful of what sense the original word is used in, and whether the synonyms are used in the same sense.
Latent Semantic Indexing
Looks for the main dimensions of variation, similar to PCA, eigenvalue methods. Uses a name-by-description matrix.
Look into how this works and if it's possible to implement in a short-term project.
More on Stop Words
Several lists of common stop words already exist. Typically, people subtract/add words as needed for the particular application (e.g. remove 'across' because it contains information about spatial relationships, add 'data' because all descriptions are about data).
WordNet is a large thesaurus compiled by hand (NOT automated). It's structured in a hierarchy tree such that the synonyms, meronyms, hyponyms, meronyms, etc. of a word can be located via tree traversals.
Can be used to build something similar to a latent semantic index. This would probably be error-prone though- e.g. the most common sense of a word isn't necessarily the one we want.
Is there a way to look at ontological distances between any two words in the hierarchy tree?
Look into ontology alignment.
- Good parsers/taggers/chunkers are slow, especially when used on long sentences. Is that a problem here?
- It's conceivable that term descriptions may not contain complete sentences- which will likely cause problems for most parsers. Perhaps there are resources that help determine whether a sentence is well-formed.
- Given the length of this project, it would be easier to stick with unsupervised rather than semi-supervised learning algorithms.
Short Term Goals
Work on functions to handle stemming and word counts.
Look into ACL documents, latent semantic indexing, ontology alignment and the capabilities of WordNet.