Wednesday, July 8, 2009

Added to the SVN trunk:
  1. nltk_stopwords.txt- the list of stop words taken from the NLTK web site. A lot of them aren't applicable to our purposes (e.g. yourselves, alone, greetings) but some of them we may want to treat as useful words (e.g. contains, value).
  2. vdc_stopwords.txt- a more basic list, containing mostly obvious stop words (e.g. the, as, or) and FGDC-specific words (e.g. repeat, which only appears in "repeat as needed").
Links to the stop word lists can be found here:
https://code.ecoinformatics.org/code/vdc/projects/machlearn/trunk/

Modifications to prep_fns.py:
  1. Added readList(), a function that reads the lines of a file into a list. Primarily for use in reading files of stop words.
  2. Modified readTerms() to lowercase all input. This facilitates comparisons between tokens.
TO DO:
  1. Encapsulate some WordNet functions to use in processing term descriptions.
  2. Write a function that filters and compares descriptions, and returns a similarity ranking for the entire list (with respect to a single term).

No comments:

Post a Comment