Tuesday, June 16, 2009

Notes From Meeting with Bruce

  1. Looks like Namrata and I have slightly different parser output formats. And the term and descriptions are associated by context only- they're on adjacent lines. A three-column format might be better; XML would also be an improvement. XML has existing parsers, and the FGDC bio hierarchy is already in XML.
  2. Add more explicit documentation to the code, as well as commit comments.
  3. Similarity scores.
Similarity scores are numerical values that will be used to describe the similarity between either two descriptions or two term/description pairs. They'll be based on the number of tokens that appear in both descriptions.
There are a couple of issues involved in computing them:
  1. Longer descriptions are more likely to share more tokens.
  2. Descriptions may not be the same length.
  3. Stemming can affect the number of matches.
  4. So can stop words and stock words.
As far as description length is concerned, the score can be normalized to the arithmetic or geometric mean of the two description lengths (but which one is better suited to this purpose?). A first pass approach might be to sum the percentage of tokens from description A appearing in description B, and the percentage of tokens from description B appearing in description A.

TO DO:
  1. Read about XML.
  2. Read about SVN keywords.
  3. Similarity scores.

No comments:

Post a Comment