- Looks like Namrata and I have slightly different parser output formats. And the term and descriptions are associated by context only- they're on adjacent lines. A three-column format might be better; XML would also be an improvement. XML has existing parsers, and the FGDC bio hierarchy is already in XML.
- Add more explicit documentation to the code, as well as commit comments.
- Similarity scores.
There are a couple of issues involved in computing them:
- Longer descriptions are more likely to share more tokens.
- Descriptions may not be the same length.
- Stemming can affect the number of matches.
- So can stop words and stock words.
TO DO:
- Read about XML.
- Read about SVN keywords.
- Similarity scores.
No comments:
Post a Comment