Wednesday, July 8, 2009

Added to the SVN trunk:
  1. nltk_stopwords.txt- the list of stop words taken from the NLTK web site. A lot of them aren't applicable to our purposes (e.g. yourselves, alone, greetings) but some of them we may want to treat as useful words (e.g. contains, value).
  2. vdc_stopwords.txt- a more basic list, containing mostly obvious stop words (e.g. the, as, or) and FGDC-specific words (e.g. repeat, which only appears in "repeat as needed").
Links to the stop word lists can be found here:

Modifications to
  1. Added readList(), a function that reads the lines of a file into a list. Primarily for use in reading files of stop words.
  2. Modified readTerms() to lowercase all input. This facilitates comparisons between tokens.
  1. Encapsulate some WordNet functions to use in processing term descriptions.
  2. Write a function that filters and compares descriptions, and returns a similarity ranking for the entire list (with respect to a single term).

No comments:

Post a Comment