Still having trouble with IDLE. It shouldn't be constantly not responding and taking up 70% of the CPU when I just click on the window and start typing.
Looked into using Eclipse with PyDev; either it or I can't properly find the Python interpreter.
Looked into using Eric, but couldn't install one of the prerequisites because it appears that there's a piece of it missing.
It's possible to use the Unix terminal, but it doesn't like having multi-line scripts pasted into it.
Tuesday, June 30, 2009
Monday, June 29, 2009
A couple of things
This morning, I started thinking about how to generate a list of stop words. Looks like it can't be just by frequency though. Here are the 100 most common tokens appearing in the FGDC term descriptions:
['the', '.', 'of', ',', 'data', 'and', 'a', 'set', 'or', 'to', 'for', 'in', '(', 'as', 'is', '"', 'used', 'which', 'information', 'needed', 'Repeat', 'name', 'that', 'projection', '-', 'an', 'by', 'be', ')', 'on', ':', 'from', 'description', 'values', 'system', 'are', 'parameters', 'about', 'point', 'reference', 'time', 'metadata', 'contains', 'expressed', 'The', 'identification', 'coordinate', 'attribute', 'value', 'with', 'Rank', 'Taxon', 'date', 'measure', 'Information', 's', 'objects', 'map', 'any', 'Earth', 'surface', 'include', 'number', 'distance', 'accuracy', 'provided', 'file', 'field', 'longitude', 'can', 'latitude', 'event', 'means', 'geologic', 'restrictions', 'organization', 'other', 'Taxonomic', 'two', 'this', 'This', 'individual', 'Name', 'Value', 'For', 'between', 'Classification', 'use', 'example', 'not', ').', 'assigned', 'source', 'spatial', 'standard', 'measured', 'computer', 'period', 'using', 'line']
Some of these obviously are more useful to keep than others, and there's no clear break between frequent, useless words and rarer useful ones (e.g. 'time' appears early in the list but 'using' is in position 99).
Is it worth trying to automatically generate a list of stop words? Do such lists already exist? Or would it make sense to start with an ad hoc list and refine it as time goes on?
-----
Recently, IDLE (the development environment packaged with python) has been maxing out my CPU, even right after a restart. I upgraded to the newest version of Python, discovered that NLTK isn't supported on it, and installed the last version that works with NLTK.
Unfortunately, re-installing NLTK puts it where only the older version (the one I was using prior to this mess) can access it. Not sure how to fix this problem- I can't find any uninstall option for any of these.
I'm also looking for a different development environment that won't tax the processor so badly, but haven't gotten anything up and running yet.
UPDATE: Got IDLE so it sees NLTK again. I can get back to work again! Except that IDLE still maxes out the CPU.
['the', '.', 'of', ',', 'data', 'and', 'a', 'set', 'or', 'to', 'for', 'in', '(', 'as', 'is', '"', 'used', 'which', 'information', 'needed', 'Repeat', 'name', 'that', 'projection', '-', 'an', 'by', 'be', ')', 'on', ':', 'from', 'description', 'values', 'system', 'are', 'parameters', 'about', 'point', 'reference', 'time', 'metadata', 'contains', 'expressed', 'The', 'identification', 'coordinate', 'attribute', 'value', 'with', 'Rank', 'Taxon', 'date', 'measure', 'Information', 's', 'objects', 'map', 'any', 'Earth', 'surface', 'include', 'number', 'distance', 'accuracy', 'provided', 'file', 'field', 'longitude', 'can', 'latitude', 'event', 'means', 'geologic', 'restrictions', 'organization', 'other', 'Taxonomic', 'two', 'this', 'This', 'individual', 'Name', 'Value', 'For', 'between', 'Classification', 'use', 'example', 'not', ').', 'assigned', 'source', 'spatial', 'standard', 'measured', 'computer', 'period', 'using', 'line']
Some of these obviously are more useful to keep than others, and there's no clear break between frequent, useless words and rarer useful ones (e.g. 'time' appears early in the list but 'using' is in position 99).
Is it worth trying to automatically generate a list of stop words? Do such lists already exist? Or would it make sense to start with an ad hoc list and refine it as time goes on?
-----
Recently, IDLE (the development environment packaged with python) has been maxing out my CPU, even right after a restart. I upgraded to the newest version of Python, discovered that NLTK isn't supported on it, and installed the last version that works with NLTK.
Unfortunately, re-installing NLTK puts it where only the older version (the one I was using prior to this mess) can access it. Not sure how to fix this problem- I can't find any uninstall option for any of these.
I'm also looking for a different development environment that won't tax the processor so badly, but haven't gotten anything up and running yet.
UPDATE: Got IDLE so it sees NLTK again. I can get back to work again! Except that IDLE still maxes out the CPU.
Friday, June 26, 2009
In terms of the overall project, my plan is to analyze the term descriptions using natural language tools, with the output of this process being two lists of one or more tokens. A comparison function would quantify the similarity between these lists.
There seem to be two main approaches: measuring the longest subsequence, and quantifying the overlap based on shared elements.
Since the token lists can be thought of as sequences made up of tokens, analogous to the way that strings are sequences of characters, I've been looking at a few string similarity measures. The most famous of those appear to be:
1. The Levenshtein (or Edit) Distance
Characterized by the the minimum number of operations needed to transform one sequence into the other
operations include: insertion, deletion and substitution of an element (a modified version also allows transposition)
2. Jaro-Winkler Distance
Characterized by the number of elements from one sequence match elements in another sequence- in this case, the concept of matching includes not only the content of the element, but its position within the sequence. Sequences that match from the beginning are considered more similar than those that do not.
Works best on short sequences.
3. Dice Coefficient
Compares two sets. When strings are compared, the set elements are typically bigrams, rather than characters.
4. Jaccard Index:
Compares two sets by dividing their intersection by their union.
None of these jumps out as an ideal measure:
The Levenshtein Distance and Jaro-Winkler Distance emphasize the position of elements in the sequence- whereas what we're interested is more the syntactic relatedness of elements within the sequences. In addition, calculation of the Levenshtein distance appears to be somewhat time-consuming.
The Dice Coeffiecient and Jaccard Index both rely on sets, which by nature ignore duplicate elements- but the duplication of the same word in two descriptons may be a good indicator of similarity.
Perhaps it would make sense to modify one or more of these to create an applicable similarity measure or measures?
I also had a look at the Lucene similarity class- it looks good, but it'd take a lot of reworking, because it's intended to match pre-indexed documents to a set of keywords, rather than to compare two lists of tokens.
Any ideas? What am I overlooking?
There seem to be two main approaches: measuring the longest subsequence, and quantifying the overlap based on shared elements.
Since the token lists can be thought of as sequences made up of tokens, analogous to the way that strings are sequences of characters, I've been looking at a few string similarity measures. The most famous of those appear to be:
1. The Levenshtein (or Edit) Distance
Characterized by the the minimum number of operations needed to transform one sequence into the other
operations include: insertion, deletion and substitution of an element (a modified version also allows transposition)
2. Jaro-Winkler Distance
Characterized by the number of elements from one sequence match elements in another sequence- in this case, the concept of matching includes not only the content of the element, but its position within the sequence. Sequences that match from the beginning are considered more similar than those that do not.
Works best on short sequences.
3. Dice Coefficient
Compares two sets. When strings are compared, the set elements are typically bigrams, rather than characters.
4. Jaccard Index:
Compares two sets by dividing their intersection by their union.
None of these jumps out as an ideal measure:
The Levenshtein Distance and Jaro-Winkler Distance emphasize the position of elements in the sequence- whereas what we're interested is more the syntactic relatedness of elements within the sequences. In addition, calculation of the Levenshtein distance appears to be somewhat time-consuming.
The Dice Coeffiecient and Jaccard Index both rely on sets, which by nature ignore duplicate elements- but the duplication of the same word in two descriptons may be a good indicator of similarity.
Perhaps it would make sense to modify one or more of these to create an applicable similarity measure or measures?
I also had a look at the Lucene similarity class- it looks good, but it'd take a lot of reworking, because it's intended to match pre-indexed documents to a set of keywords, rather than to compare two lists of tokens.
Any ideas? What am I overlooking?
Wednesday, June 24, 2009
New function added:
readTerms()
reads terms from a .txt version of the XML format for terms, and returns a Python dictionary. The keys are always the term names, but the user may specify whether the values are the terms' description, hierarchy path, type, domain, or short name. It is also possible to specify that all term information be included in the value field.
The format of the XML file read by readTerms() is pasted from an e-mail from Dave.
NOTE: I've added extra tags for the FGDC-specific fields of type, domain and short name.
/simple/xpath/to/term
Description of the term.
type of term
domain over which the term is relevant
short name of term
...
readTerms()
reads terms from a .txt version of the XML format for terms, and returns a Python dictionary. The keys are always the term names, but the user may specify whether the values are the terms' description, hierarchy path, type, domain, or short name. It is also possible to specify that all term information be included in the value field.
The format of the XML file read by readTerms() is pasted from an e-mail from Dave.
NOTE: I've added extra tags for the FGDC-specific fields of type, domain and short name.
...
Tuesday, June 23, 2009
Finished the extract() function.
Its output can be found in my SVN trunk, in fgdc_files/fgdc_terms.txt
Steps to get this output:
1. save FGDC standard as a .txt document.
2. Run the paths() function to create a dictionary of each term and its hierarchy path.
3. Pass the dictionary to dictWriter(), to save it.
Or put it directly into extract(), as a parameter.
4. Run extract(). It will ask for the following file paths:
NOTE: Not all of the terms in fgdc_terms.txt have hierarchy paths, due to inconsistencies between the FGDC standard and the hierarchy XML file.
Its output can be found in my SVN trunk, in fgdc_files/fgdc_terms.txt
Steps to get this output:
1. save FGDC standard as a .txt document.
2. Run the paths() function to create a dictionary of each term and its hierarchy path.
3. Pass the dictionary to dictWriter(), to save it.
Or put it directly into extract(), as a parameter.
4. Run extract(). It will ask for the following file paths:
- FGDC Standard
- FGDC hierarchy
- Where to put the output
- URL of FGDC Standard
NOTE: Not all of the terms in fgdc_terms.txt have hierarchy paths, due to inconsistencies between the FGDC standard and the hierarchy XML file.
Saturday, June 20, 2009
Friday, June 19, 2009
New Additions
Wrote a couple of new functions for dealing with the FGDC hierarchy:
paths()
reads a .txt version of the XML hierarchy for the FGDC bio standard. Returns a dictionary containing the short name of each term as a keyword, and its path as a value.
The algorithm takes advantage of the fact that parent elements are defined before their children, and that each parent term contains references to each child. The file is read line-by-line; the root element written into the dictionary with identical key and value. The paths of its children are created by appending each child's name to the parent's path. When every subsequent element is encountered, its path is located in the dictionary, and its childrens' hierarchy paths are added to the dictionary.
dictWriter()
writes an existing dictionary (e.g. the pathway dictionary) to a text file, separating keys and elements with a user-defined delimiter (default is tab).
UPDATE: Added a directory to the SVN repository (in trunk) to hold the FGDC bio standard and related text files (including the output of path() and dictWriter())
paths()
reads a .txt version of the XML hierarchy for the FGDC bio standard. Returns a dictionary containing the short name of each term as a keyword, and its path as a value.
The algorithm takes advantage of the fact that parent elements are defined before their children, and that each parent term contains references to each child. The file is read line-by-line; the root element written into the dictionary with identical key and value. The paths of its children are created by appending each child's name to the parent's path. When every subsequent element is encountered, its path is located in the dictionary, and its childrens' hierarchy paths are added to the dictionary.
dictWriter()
writes an existing dictionary (e.g. the pathway dictionary) to a text file, separating keys and elements with a user-defined delimiter (default is tab).
UPDATE: Added a directory to the SVN repository (in trunk) to hold the FGDC bio standard and related text files (including the output of path() and dictWriter())
Tuesday, June 16, 2009
Notes From Meeting with Bruce
- Looks like Namrata and I have slightly different parser output formats. And the term and descriptions are associated by context only- they're on adjacent lines. A three-column format might be better; XML would also be an improvement. XML has existing parsers, and the FGDC bio hierarchy is already in XML.
- Add more explicit documentation to the code, as well as commit comments.
- Similarity scores.
There are a couple of issues involved in computing them:
- Longer descriptions are more likely to share more tokens.
- Descriptions may not be the same length.
- Stemming can affect the number of matches.
- So can stop words and stock words.
TO DO:
- Read about XML.
- Read about SVN keywords.
- Similarity scores.
Monday, June 15, 2009
Progress...
Uploaded two new Python functions:
1. termReader(term_file, terms = {}) reads a text file with term/description pairs (format described below) and writes them into a dictionary. The terms parameter allows for the addition of terms to an existing dictionary.
2. tokens(terms) takes a dictionary and tokenizes the values using the NLTK's WordPunctTokenizer(). It currently writes the tokenized lists back into the same dictionary; I'm planning to change that. I'll also try to think of a more apt name for this function.
(Question: Is context information lost when the strings are tokenized?)
All in all, this is moving along a little bit slowly so far, because I haven't used Python in a long time. There's a lot of going and looking things up. As I get a better feel of it, chances are I'll come back to this early code and take advantage of more of the functionality, instead of looping through everything.
1. termReader(term_file, terms = {}) reads a text file with term/description pairs (format described below) and writes them into a dictionary. The terms parameter allows for the addition of terms to an existing dictionary.
2. tokens(terms) takes a dictionary and tokenizes the values using the NLTK's WordPunctTokenizer(). It currently writes the tokenized lists back into the same dictionary; I'm planning to change that. I'll also try to think of a more apt name for this function.
(Question: Is context information lost when the strings are tokenized?)
All in all, this is moving along a little bit slowly so far, because I haven't used Python in a long time. There's a lot of going and looking things up. As I get a better feel of it, chances are I'll come back to this early code and take advantage of more of the functionality, instead of looping through everything.
Friday, June 12, 2009
This week, I've been putting together some Python code to extract the fields and descriptions from the FGDC bio standard, which can be found here:
http://www.nbii.gov/images/uploaded/151871_1162508205217_BDP-workbook.doc
It takes terms in this format:
1.2.3 Term -- description blah blah blah
Type: type
Short Name: shortname
and writes a text file in this format:
term = "Term"
description = " description blah blah blah
Type: type
Short Name: shortname
"
This is the same output format as Namrata uses, where the term name and description are enclosed in quotes.
The bio elements in FGDC begin with BDP1.2.3 instead of 1.2.3, so I'll take that into account in the code before committing it to SVN.
--------- HOWEVER --------
So far, I haven't written in a way to preserve the hierarchy. The numbering 1.2.3 alludes to it, but the actual names of the levels are given in diagrams of the FGDC standard. Not sure if there's a(n easy) way to automatically pull out that information. Ideas are welcome.
Otherwise, I'll probably store that information in the code itself and access it as needed, for the sake of having it working soon with the current standard at least.
UPDATE: Committed to SVN.
http://www.nbii.gov/images/
It takes terms in this format:
1.2.3 Term -- description blah blah blah
Type: type
Short Name: shortname
and writes a text file in this format:
term = "Term"
description = " description blah blah blah
Type: type
Short Name: shortname
"
This is the same output format as Namrata uses, where the term name and description are enclosed in quotes.
The bio elements in FGDC begin with BDP1.2.3 instead of 1.2.3, so I'll take that into account in the code before committing it to SVN.
--------- HOWEVER --------
So far, I haven't written in a way to preserve the hierarchy. The numbering 1.2.3 alludes to it, but the actual names of the levels are given in diagrams of the FGDC standard. Not sure if there's a(n easy) way to automatically pull out that information. Ideas are welcome.
Otherwise, I'll probably store that information in the code itself and access it as needed, for the sake of having it working soon with the current standard at least.
UPDATE: Committed to SVN.
Subscribe to:
Posts (Atom)