VDC 09 Development Blog: June 2009

Tuesday, June 30, 2009

Still having trouble with IDLE. It shouldn't be constantly not responding and taking up 70% of the CPU when I just click on the window and start typing.

Looked into using Eclipse with PyDev; either it or I can't properly find the Python interpreter.

Looked into using Eric, but couldn't install one of the prerequisites because it appears that there's a piece of it missing.

It's possible to use the Unix terminal, but it doesn't like having multi-line scripts pasted into it.

Monday, June 29, 2009

A couple of things

This morning, I started thinking about how to generate a list of stop words. Looks like it can't be just by frequency though. Here are the 100 most common tokens appearing in the FGDC term descriptions:

['the', '.', 'of', ',', 'data', 'and', 'a', 'set', 'or', 'to', 'for', 'in', '(', 'as', 'is', '"', 'used', 'which', 'information', 'needed', 'Repeat', 'name', 'that', 'projection', '-', 'an', 'by', 'be', ')', 'on', ':', 'from', 'description', 'values', 'system', 'are', 'parameters', 'about', 'point', 'reference', 'time', 'metadata', 'contains', 'expressed', 'The', 'identification', 'coordinate', 'attribute', 'value', 'with', 'Rank', 'Taxon', 'date', 'measure', 'Information', 's', 'objects', 'map', 'any', 'Earth', 'surface', 'include', 'number', 'distance', 'accuracy', 'provided', 'file', 'field', 'longitude', 'can', 'latitude', 'event', 'means', 'geologic', 'restrictions', 'organization', 'other', 'Taxonomic', 'two', 'this', 'This', 'individual', 'Name', 'Value', 'For', 'between', 'Classification', 'use', 'example', 'not', ').', 'assigned', 'source', 'spatial', 'standard', 'measured', 'computer', 'period', 'using', 'line']

Some of these obviously are more useful to keep than others, and there's no clear break between frequent, useless words and rarer useful ones (e.g. 'time' appears early in the list but 'using' is in position 99).

Is it worth trying to automatically generate a list of stop words? Do such lists already exist? Or would it make sense to start with an ad hoc list and refine it as time goes on?

-----

Recently, IDLE (the development environment packaged with python) has been maxing out my CPU, even right after a restart. I upgraded to the newest version of Python, discovered that NLTK isn't supported on it, and installed the last version that works with NLTK.

Unfortunately, re-installing NLTK puts it where only the older version (the one I was using prior to this mess) can access it. Not sure how to fix this problem- I can't find any uninstall option for any of these.

I'm also looking for a different development environment that won't tax the processor so badly, but haven't gotten anything up and running yet.

UPDATE: Got IDLE so it sees NLTK again. I can get back to work again! Except that IDLE still maxes out the CPU.

Friday, June 26, 2009

In terms of the overall project, my plan is to analyze the term descriptions using natural language tools, with the output of this process being two lists of one or more tokens. A comparison function would quantify the similarity between these lists.

There seem to be two main approaches: measuring the longest subsequence, and quantifying the overlap based on shared elements.

Since the token lists can be thought of as sequences made up of tokens, analogous to the way that strings are sequences of characters, I've been looking at a few string similarity measures. The most famous of those appear to be:

1. The Levenshtein (or Edit) Distance
Characterized by the the minimum number of operations needed to transform one sequence into the other
operations include: insertion, deletion and substitution of an element (a modified version also allows transposition)

2. Jaro-Winkler Distance
Characterized by the number of elements from one sequence match elements in another sequence- in this case, the concept of matching includes not only the content of the element, but its position within the sequence. Sequences that match from the beginning are considered more similar than those that do not.
Works best on short sequences.

3. Dice Coefficient
Compares two sets. When strings are compared, the set elements are typically bigrams, rather than characters.

4. Jaccard Index:
Compares two sets by dividing their intersection by their union.

None of these jumps out as an ideal measure:
The Levenshtein Distance and Jaro-Winkler Distance emphasize the position of elements in the sequence- whereas what we're interested is more the syntactic relatedness of elements within the sequences. In addition, calculation of the Levenshtein distance appears to be somewhat time-consuming.

The Dice Coeffiecient and Jaccard Index both rely on sets, which by nature ignore duplicate elements- but the duplication of the same word in two descriptons may be a good indicator of similarity.

Perhaps it would make sense to modify one or more of these to create an applicable similarity measure or measures?

I also had a look at the Lucene similarity class- it looks good, but it'd take a lot of reworking, because it's intended to match pre-indexed documents to a set of keywords, rather than to compare two lists of tokens.

Any ideas? What am I overlooking?

Wednesday, June 24, 2009

New function added:

readTerms()
reads terms from a .txt version of the XML format for terms, and returns a Python dictionary. The keys are always the term names, but the user may specify whether the values are the terms' description, hierarchy path, type, domain, or short name. It is also possible to specify that all term information be included in the value field.

The format of the XML file read by readTerms() is pasted from an e-mail from Dave.
NOTE: I've added extra tags for the FGDC-specific fields of type, domain and short name.

/simple/xpath/to/term
Description of the term.
type of term
domain over which the term is relevant
short name of term

...

Tuesday, June 23, 2009

Finished the extract() function.
Its output can be found in my SVN trunk, in fgdc_files/fgdc_terms.txt

Steps to get this output:
1. save FGDC standard as a .txt document.
2. Run the paths() function to create a dictionary of each term and its hierarchy path.
3. Pass the dictionary to dictWriter(), to save it.
Or put it directly into extract(), as a parameter.
4. Run extract(). It will ask for the following file paths:

FGDC Standard
FGDC hierarchy
Where to put the output
URL of FGDC Standard

NOTE: Not all of the terms in fgdc_terms.txt have hierarchy paths, due to inconsistencies between the FGDC standard and the hierarchy XML file.

Saturday, June 20, 2009

Lately, when I've started IDLE (the python editor), it can't find the NLTK files to import. When I install the NLTK toolkit, IDLE can find it the first time or two, and then loses it.

I've installed NLTK twice already and am about to do it a third time. Any ideas?

Friday, June 19, 2009

New Additions

Wrote a couple of new functions for dealing with the FGDC hierarchy:

paths()
reads a .txt version of the XML hierarchy for the FGDC bio standard. Returns a dictionary containing the short name of each term as a keyword, and its path as a value.

The algorithm takes advantage of the fact that parent elements are defined before their children, and that each parent term contains references to each child. The file is read line-by-line; the root element written into the dictionary with identical key and value. The paths of its children are created by appending each child's name to the parent's path. When every subsequent element is encountered, its path is located in the dictionary, and its childrens' hierarchy paths are added to the dictionary.

dictWriter()
writes an existing dictionary (e.g. the pathway dictionary) to a text file, separating keys and elements with a user-defined delimiter (default is tab).

UPDATE: Added a directory to the SVN repository (in trunk) to hold the FGDC bio standard and related text files (including the output of path() and dictWriter())

Tuesday, June 16, 2009

Notes From Meeting with Bruce

Looks like Namrata and I have slightly different parser output formats. And the term and descriptions are associated by context only- they're on adjacent lines. A three-column format might be better; XML would also be an improvement. XML has existing parsers, and the FGDC bio hierarchy is already in XML.
Add more explicit documentation to the code, as well as commit comments.
Similarity scores.

Similarity scores are numerical values that will be used to describe the similarity between either two descriptions or two term/description pairs. They'll be based on the number of tokens that appear in both descriptions.
There are a couple of issues involved in computing them:

Longer descriptions are more likely to share more tokens.
Descriptions may not be the same length.
Stemming can affect the number of matches.
So can stop words and stock words.

As far as description length is concerned, the score can be normalized to the arithmetic or geometric mean of the two description lengths (but which one is better suited to this purpose?). A first pass approach might be to sum the percentage of tokens from description A appearing in description B, and the percentage of tokens from description B appearing in description A.

TO DO:

Read about XML.
Read about SVN keywords.
Similarity scores.

Monday, June 15, 2009

Progress...

Uploaded two new Python functions:

1. termReader(term_file, terms = {}) reads a text file with term/description pairs (format described below) and writes them into a dictionary. The terms parameter allows for the addition of terms to an existing dictionary.

2. tokens(terms) takes a dictionary and tokenizes the values using the NLTK's WordPunctTokenizer(). It currently writes the tokenized lists back into the same dictionary; I'm planning to change that. I'll also try to think of a more apt name for this function.
(Question: Is context information lost when the strings are tokenized?)

All in all, this is moving along a little bit slowly so far, because I haven't used Python in a long time. There's a lot of going and looking things up. As I get a better feel of it, chances are I'll come back to this early code and take advantage of more of the functionality, instead of looping through everything.

Friday, June 12, 2009

This week, I've been putting together some Python code to extract the fields and descriptions from the FGDC bio standard, which can be found here:

http://www.nbii.gov/images/uploaded/151871_1162508205217_BDP-workbook.doc

It takes terms in this format:

1.2.3 Term -- description blah blah blah
Type: type
Short Name: shortname

and writes a text file in this format:

term = "Term"
description = " description blah blah blah
Type: type
Short Name: shortname
"

This is the same output format as Namrata uses, where the term name and description are enclosed in quotes.

The bio elements in FGDC begin with BDP1.2.3 instead of 1.2.3, so I'll take that into account in the code before committing it to SVN.

--------- HOWEVER --------
So far, I haven't written in a way to preserve the hierarchy. The numbering 1.2.3 alludes to it, but the actual names of the levels are given in diagrams of the FGDC standard. Not sure if there's a(n easy) way to automatically pull out that information. Ideas are welcome.

Otherwise, I'll probably store that information in the code itself and access it as needed, for the sake of having it working soon with the current standard at least.

UPDATE: Committed to SVN.

VDC 09 Development Blog