- nltk_stopwords.txt- the list of stop words taken from the NLTK web site. A lot of them aren't applicable to our purposes (e.g. yourselves, alone, greetings) but some of them we may want to treat as useful words (e.g. contains, value).
- vdc_stopwords.txt- a more basic list, containing mostly obvious stop words (e.g. the, as, or) and FGDC-specific words (e.g. repeat, which only appears in "repeat as needed").
https://code.ecoinformatics.org/code/vdc/projects/machlearn/trunk/
Modifications to prep_fns.py:
- Added readList(), a function that reads the lines of a file into a list. Primarily for use in reading files of stop words.
- Modified readTerms() to lowercase all input. This facilitates comparisons between tokens.
- Encapsulate some WordNet functions to use in processing term descriptions.
- Write a function that filters and compares descriptions, and returns a similarity ranking for the entire list (with respect to a single term).
No comments:
Post a Comment