This morning, I started thinking about how to generate a list of stop words. Looks like it can't be just by frequency though. Here are the 100 most common tokens appearing in the FGDC term descriptions:
['the', '.', 'of', ',', 'data', 'and', 'a', 'set', 'or', 'to', 'for', 'in', '(', 'as', 'is', '"', 'used', 'which', 'information', 'needed', 'Repeat', 'name', 'that', 'projection', '-', 'an', 'by', 'be', ')', 'on', ':', 'from', 'description', 'values', 'system', 'are', 'parameters', 'about', 'point', 'reference', 'time', 'metadata', 'contains', 'expressed', 'The', 'identification', 'coordinate', 'attribute', 'value', 'with', 'Rank', 'Taxon', 'date', 'measure', 'Information', 's', 'objects', 'map', 'any', 'Earth', 'surface', 'include', 'number', 'distance', 'accuracy', 'provided', 'file', 'field', 'longitude', 'can', 'latitude', 'event', 'means', 'geologic', 'restrictions', 'organization', 'other', 'Taxonomic', 'two', 'this', 'This', 'individual', 'Name', 'Value', 'For', 'between', 'Classification', 'use', 'example', 'not', ').', 'assigned', 'source', 'spatial', 'standard', 'measured', 'computer', 'period', 'using', 'line']
Some of these obviously are more useful to keep than others, and there's no clear break between frequent, useless words and rarer useful ones (e.g. 'time' appears early in the list but 'using' is in position 99).
Is it worth trying to automatically generate a list of stop words? Do such lists already exist? Or would it make sense to start with an ad hoc list and refine it as time goes on?
-----
Recently, IDLE (the development environment packaged with python) has been maxing out my CPU, even right after a restart. I upgraded to the newest version of Python, discovered that NLTK isn't supported on it, and installed the last version that works with NLTK.
Unfortunately, re-installing NLTK puts it where only the older version (the one I was using prior to this mess) can access it. Not sure how to fix this problem- I can't find any uninstall option for any of these.
I'm also looking for a different development environment that won't tax the processor so badly, but haven't gotten anything up and running yet.
UPDATE: Got IDLE so it sees NLTK again. I can get back to work again! Except that IDLE still maxes out the CPU.