VDC 09 Development Blog: 2009

Wednesday, August 19, 2009

some final updates

A couple of new functions added over the weekend:

rankDist() calculates the similarity distance between the first and second hits of one or more rankings. NOTE: This measure is only useful when the first hit is known to be accurate.
compareTerms() applies a filter algorithm to two terms, and calculates their similarity based on the result. Analogous to termRanker(), which compares a reference term against an entire term dictionary.

Other than that, the documentation is done and up; it's the README document under the 'documentation repository' link in the sidebar. It's in .doc format so that it can be modified.

And that's about my stopping point; it's mid-August and time to go back to my thesis. What a summer!

Friday, August 14, 2009

A RESULT!

Ran a comparison of two Darwin Core terms, MaximumElevationInMeters and MinimumElevationInMeters against the FGDC standard. I chose these terms because they appear to be the only ones with obvious exact matches in FGDC: Altitude Maximum and Altitude Minimum.

The filter algorithm I used was:

Augment each description with the WordNet similar-tos of the synsets of every word in that description.
Add every word's synonyms to the description.
Throw away all words with length < 6.

This is the output that comes from a call to correspRank() (with the two rankings as input):
Top Rank
1
Bottom Rank
1
Range
0
Median Rank
1.0

Top Score
0.369239797651
Bottom Score
0.360379846875
Range
0.00885995077593
Median Score
0.364809822263

Note that mean and variance are not reported- I haven't written them in yet because my sample sizes so far are too small.

Let's look at the top 5 hits in both rankings:
MaximumElevationInMeters

Altitude Maximum
Altitude System Definition
Altitude System Definition
False Easting
False Easting

MinimumElevationInMeters

Altitude Minimum
Altitude System Definition
Altitude System Definition
Altitude Maximum
False Easting

Interestingly, Altitude Maximum makes it into the top five for MinimumElevationInMeters, but the reverse is not true.

One that's not so good about this result is that the filter/score/rank/assess process took about 141 seconds of processor time. Ouch!

what's new

Another list of updates!

Changed termRanker() and keyConvert() to recognize the fact that term-to-term similarity scores are not necessarily unique. Removed the option for a score-keyed dictionary as the output of termRanker(). However, keyConvert() can be used to change any ranking dictionary's keys to scores. In that case, the dictionary values are lists of information about one to several terms. These dictionaries can be converted back to rank- and path-keyed dictionaries using keyConvert().
Separated the calculation and reporting of statistics from the other functions in rankComp.py. These are now handled by getStats().
Modified cossim() to return 0 in the case that the filter functions remove all of the words of the description.
Added future directions and recommendations to the documentation file.

Changed callFilters() to call the functions directly out of a dictionary, rather than using and if...elif...elif... setup. This shortens the code considerably, and reduces the number of functions that have to be modified when a new filter is added. Wrote a function, buildFnDict, that builds the dictionary. Its output is also used by filterHelp(), which is also much shorter now.

Tuesday, August 11, 2009

a couple of updates

Practiced the final presentation yesterday- I've made a few changes to the slides and documentation for tomorrow. They'll be available in the new 'Documentation Repository' link in the Related Links section soon.

Other than that, I'm in the middle of debugging some older stuff and some newer stuff.

fixed termReader() so that it reads the last term in the XML file
improved the efficiency of xmlEsc()
IN PROGRESS- working on some problems in the ranking comparison functions.

Friday, August 7, 2009

Uh oh, the official Python documentation seems to be down!

What I did Today:

Practiced next week's presentation using Adobe Connect. Had some interesting adventures with the sound, and got some good feedback.
Removed special characters (I think they're regex special??) from the FGDC output, for Namrata's program, which uses jdom.
Started working on functions to help users compare the quality of the outputs of different algorithms, based on known sub-crosswalks. There will be one that looks at the clustering of related terms, one that looks at the positions of known matches, and a supporting function that corrects dictionary formatting when necessary.

Wednesday, August 5, 2009

some other changes

I've made a few other changes in the last couple of days:

Modifications to the FGDC Bio Standard (fgdcbio.txt) A number of the terms had short names that didn't match up between the hierarchy definition and the standard definition. Because the names appeared multiple times in the hierarchy and once in the standard, I changed them in the standard. I've tried to make minimal changes to the standard (e.g. by coding around typos) because my copy probably isn't going to end up in wide distribution. But for matching terms between the two files, it is necessary that each term has the same short name in both. Here is a list of changes:

placek to placekt
taxonpr to taxonpro
orien to orienta

In readTerms(), removed the option to allow term names as keys in the dictionary- because this field is unique in neither FGDC not EML. Currently, the only allowed keys are xpaths.

Deprecated paths(), dictReader() and dictWriter(). paths() constructs hierarchy paths from fgdc_hierarchy.xml, but writes them into a dictionary. If a term appears in multiple places in the hierarchy, the duplicates are overwritten- a major flaw in the function. dictReader() and dictWriter() only exist as helpers to paths().

Added getPaths(), a replacement for paths(), which constructs paths from fgdc_hierarchy and writes them into a list. Have not added support functions analogous to dictReader() and dictWriter() yet.

Added the following functions to produce output compatible to Namrata's program:

pathDesc(): Given a path and a dictionary (where path is a key in the dictionary), adds the values of every term in the path into the value of the path's final term.

expandDesc(): Applies pathDesc() to an entire dictionary. This is preferable to applying pathDesc() multiple times because the results are order-dependent.

dictToXML(): Writes the contents of a dictionary (with paths as keys) into an XML file in the same format as extract(). Requires an XML file in the same format as input, to supply the information missing from the dictionary (which can only hold two types of information at once).

Moved filePrompt() to prep_fns.py.

Finally finished with the numerous small things that were plaguing the FGDC extractor. The code to write a new XML file with descriptions containing everything from the xpath is working.

I've added the output to SVN, will commit after organizing the code...

Tuesday, August 4, 2009

Sidetracks and Bug Fixes

Since Friday, I've been working on a new option for formatting the XML term file. The file I have right now associates each term with its description. But for purposes of compatibility with Namrata's output, I'm writing code to produce an alternate output, where each term contains its own description, as well as the description of every term above it in its hierarchy path.

Here are the (somewhat convoluted) steps to getting that output:
1. Extract the hierarchy paths from the hierarchy XML file (more on this in a minute).
2. Extract the FGDC terms as before.
3. Read them into a dictionary with key = 'PATH' and value = 'DESC'
4. Call expandDesc() to expand the descriptions to include everything from the path
5. Write to a new XML file with dictToXML()

This is how it would work, anyway, if there wasn't a giant problem with the first step.

As it turns out, this unearthed a bug in the original code. Originally, the hierarchy paths were read into a dictionary, whose keys were the short names of the terms. But of course, this caused multiple paths to be overwritten. So I've deprecated the paths() function and written getPaths(), which returns a list of ALL paths.

Now I just have to figure out how to associate the right path to the right term when writing to the XML file. And then it'll be back to doing ranking comparisons.

UPDATE: Since the appearance of a term in multiple hierarchy paths appears to be multiple instances of the same term (i.e. each term is defined only once in the hierarchy file), the extract() function will make as many entries for that term as it has paths. This preserves the hierarchy structure, with xpath as a unique identifier.

UPDATE: The FGDC Bio Standard Definition and hierarcy files that I have don't match up. It looks like there are a number of valid paths in the hierarchy file that are not addressed in the standard definition.

Tuesday, July 28, 2009

Comparison of Rankings (Preliminary)

Just to expand on the ad hoc method I described yesterday- Here's a list of some term clusters I'll be checking out:

G-Ring Information

Data Set G-Polygon
Data Set G-Polygon Outer G-Ring
Data Set G-Polygon Exclusion G-Ring
G-Ring Point
G-Ring Latitude
G-Ring Longitude
G-Ring

Bounding Coordinates

Bounding Coordinates
West Bounding Coordinate
East Bounding Coordinate
North Bounding Coordinate
South Bounding Coordinate
Bounding Altitudes
Altitude Minimum
Altitude Maximum
Altitude Distance Units

Taxonomic Information

General Taxonomic Coverage
Taxonomic Classification
Taxon Rank Name
Taxon Rank Value
Applicable Common Name

All of these terms/concepts exist in both the FGDC Bio and EML standards.

Monday, July 27, 2009

Debugging Note
On Friday, I discovered Python's weirdness with mutable default arguments. Default arguments are given space in memory when their function is defined, not when it is called- so Python sets them once and never checks up on them again. Usually, that's fine... but when the default is mutable and its value is changed, it retains the changed value until such time as it's updated again. It acts like a secret global variable, and it's a very, very special thing to debug.

Comparison of Filter Functions
I'm working on a preliminary/exploratory assessment of the rankings that are produced by filter functions. Since the metadata standards tend to be pretty long, I'm looking at clusters of related terms that have direct crosswalks between FGDC and EML. A better ranking will be characterized by the appearance of cluster members (a) close together and (b) near the top of the list.

Results to follow....

Thursday, July 23, 2009

a couple of new and important things

The other day, I finally got around to trying out my termReader() function on Namrata's file of EML terms. That prompted a re-write of the entire function, which is now much improved.

New termReader() capabilities:

Reads XML files, regardless of whether the tags open on a new line. The old version of this function assumed that the formatting was much more strict.
Gives the user the option of what information to use as keys in the returned dictionary. Previously, only the term names were used, but EML appears not to have unique term names. So the new version allows the use of hierarchy paths as keys- this option is set as the default.

In addition, because the original aim of this project was to take a combinatoric approach to developing a good ranking algorithm, I've added a callFilters(), a new function called by termRanker(). It takes a list of filter functions passed into termRanker() as a parameter, and applies them to the descriptions in the term dictionary. I've also created a help function to go along with it, which lists the available filter functions and how to pass them into termRanker().

Tuesday, July 21, 2009

Modified tokens() in prep_fns.py so that it will correctly remove punctuation and XML escapes from the tokenized term descriptions.

Friday, July 17, 2009

I think the output of termRanker() is pretty flexible (see previous post). But it depends on what a user would do with the rankings. Here are a few things that someone might do with a term ranking:

View it, or parts of it.
Find a particular term to see its rank.
Look at relative scores between terms.
Refine it? This gets us into new and interesting territory.

Exciting News

I've written a function called termRanker() that takes a tokenized description and a dictionary of term names and their corresponding (tokenized) descriptions. It returns a dictionary whose keys are ranks and whose values are tuples consisting of (term name, term description, score). This tuple can be unpacked easily.

An Overview of termRanker's Inner Workings:
IN: description, dictionary of terms: descriptions
OUT: A dictionary of rank: (name, description, score)

1. filter descriptions
a. decide on filter(s) and how to compound them if necessary
b. create new dictionary with
key = term name
value = filtered description
2. compute similarity scores
a. call similarity() function
b. create new dictionary with
key = score
value = term name
3. produce a ranking
a. pull up terms in the order of decreasing score
b. assign rank
c. store them in a dictionary with
key = rank
value = name, description, score
4. return results

At the moment, termRanker() employs nullFilter() from filters.py, and cossim() from similarity.py. The null filter returns the descriptions without modification, so all rankings are based solely on word frequencies.

termRanker() works fairly well on descriptions that contain similar but distinctive words. For example, ranking the FGDC term 'access constraints' returns:

access constraints
metadata access constraints
use constraints
metadata use constraints
security information
taxonomic completeness
ordering information
etc....

The first five results were the same when I passed a paraphrase of the description of access constraints into termRanker(). It should be noted, however, that the descriptions of the first four hits are very similar in structure. Using other terms, for example 'metadata citation' as the reference term returns nonsense.

TO DO:

Add a function to prep_fns.py to undo escapes for XML
test termRanker() with different filter functions. This requires figuring out how to call functions using some sort of variable- since termRanker() will call different functions, they can't be stuck straight into the code.
Maybe look into using chunkers as filter functions.
Is there any good way to compare the quality of the rankings produced by different filters and filter combinations?

Wednesday, July 15, 2009

XML Update

Got the XML issue cleared up. When I was writing the function, I forgot to replace things like '<' and '>' with escape characters. I added a function to fgdc_extractor.py that takes a string and returns the same string, with all necessary characters escaped for XML. This is called by the extract() function before it writes anything to file.

I sent the output to Namrata and she told me already that she can open it (The new XML file is in the SVN trunk as well, by the way).

More Filter Functions

I added six functions to filters.py. They retrieve the following related words from lemmas and/or synsets of the words in a term description:

hypernym distances
hypernyms
instance hypernyms
member holonyms
part holonyms
substance holonyms

As with yesterday's functions, these can be applied to either lemmas (e.g. words) or synsets. The lemma option performs very poorly most of the time, which is odd because synsets contain lemmas (I think). But rather than returning a subset of the synset option's output, the lemma option typically adds nothing at all to the description.

It's interesting that the WordNet online browser returns direct hypernyms and hyponyms, full hypernyms and hyponyms, inherited hypernyms and sister terms, etc. but the NLTK functions don't make those distinctions. Is there another Python library that has a better representation of WordNet than NLTK?

Tuesday, July 14, 2009

Today's commits

Here are the changes I committed to the SVN trunk today:

filters.py and similarity.py

Fixed the keywords.

fgdc_extractor.py
Minor bug fixes:

Bug in dictReader() that didn't open the file if the path was passed in as a parameter.
Bug in extract() that asked for path to hierarchy file even if a hierarchy dictionary is provided as a parameter.

filters.py
Added several WordNet-based functions, to retrieve the following types of related words for every word in a term description:

hyponyms
instance hyponyms
member meronyms
part meronyms
substance meronyms
entailments
attibutes
causes
also-sees
verb groups
similar-tos
pertainyms
derivationally related forms

In each case, the results are appended to the description and the full list is returned. There is also an option to include the hyponyms/meronyms/etc for each word AND its synonyms. I've included this because there are often no results returned when only the word is considered. I'm not sure why that is- either WordNet is less rich than expected, or there's a very subtle error in the code.

TO DO

Have a look at the WordNet hierarchy.

Figure out why the XML output of fgdc_extractor.extract() is producing errors. Namrata has suggested that this may be due to the presence of '<' in the text.

Write filter functions to get hypernyms and holonyms.

Write a function to filter term descriptions and rank them relative to a given description.

Wednesday, July 8, 2009

Added to the SVN trunk:

nltk_stopwords.txt- the list of stop words taken from the NLTK web site. A lot of them aren't applicable to our purposes (e.g. yourselves, alone, greetings) but some of them we may want to treat as useful words (e.g. contains, value).
vdc_stopwords.txt- a more basic list, containing mostly obvious stop words (e.g. the, as, or) and FGDC-specific words (e.g. repeat, which only appears in "repeat as needed").

Links to the stop word lists can be found here:
https://code.ecoinformatics.org/code/vdc/projects/machlearn/trunk/

Modifications to prep_fns.py:

Added readList(), a function that reads the lines of a file into a list. Primarily for use in reading files of stop words.
Modified readTerms() to lowercase all input. This facilitates comparisons between tokens.

TO DO:

Encapsulate some WordNet functions to use in processing term descriptions.
Write a function that filters and compares descriptions, and returns a similarity ranking for the entire list (with respect to a single term).

Tuesday, July 7, 2009

some new code

I committed a couple of things to the svn repository:

1. similarity.py
Contains a function to calculate the cosine similarity between two token lists. These lists can
either be tokenized descriptions, or processed versions thereof.

2. filters.py
Contains a number of functions to process and modify tokenized descriptions- including two
stemming functions, a function to throw out stop words, etc.

similarity scoring

After doing a lot of reading, I've decided to use cosine similarity to quantify the similarity between lists of tokens.

Cosine similarity is characterized by the cosine of the angle between two vectors that represent the token lists.

e.g. If area and volume are defined as:
area: length multiplied by width
volume: length multiplied by width multiplied by height

we can create a vector for each description, with as many cells as unique words in the set of BOTH descriptions (length, multiplied, by, width, height). The value in each cell is the number of occurrences of that word in the description.

So we would get something like this:
area = [1, 1 , 1 , 1, 0]
volume = [1, 2, 2, 1, 1]

From here, the calculation of the cosine is straightforward (http://upload.wikimedia.org/math/a/f/3/af3da3b8fcced86a3e7c8cf4075ed94c.png).

For the current example, the result would be 0.9 (i.e. the descriptions are extremely similar).

Advantages of using Cosine Distance

Widely used in linguistics
Easy to implement
Flexible: Vectors can be defined in multiple ways, e.g. by deleting stop words, adding synonyms, weighting rare words, or keeping only lexical chains.
Similarity is based on content, not on order (If it becomes clear that word order is important in the similarity of descriptions, we can weight the similarity value with an order-dependent coefficient).

Limitations

Does not take word sense into account. But this is true of most, if not all, single-equation measures. If it becomes an issue, I think that word sense can be dealt with at the time of vector formation.

UPDATE:
1. Wrote a function to calculate cosine similarity given two token lists.
2. Eclipse is working a whole lot better than IDLE. No lagging at all yet.

Thursday, July 2, 2009

Notes from Teleconference

The Question
How do we quantify similarity between descriptions? A lot of the usual measures rely heavily on sequence order, but not on the meaning of sequence elements- which makes sense with DNA sequences and character strings, but not so much with lists of words.

Some Different Approaches From Various Fields

Translation (Between natural languages)
Uses documents with the same content in different languages in training sets. Since different languages have different word orders, there might be some similarity measures defined in these papers that are primarily content-based.
Look into ACL conference proceedings (ACLweb.org)

Document Classification
Uses word counts, semantic overlap to quantify similarity. This is done by stemming the words in each document, and throwing out the stop words (i.e. common words with little or no semantic content). The results are usually compared by cosine similarity.

caveat: each document should be large (whereas term descriptions are typically only a sentence or two.)
Description size could be increased by pulling in the synonyms of every word. But we'd have to be careful of what sense the original word is used in, and whether the synonyms are used in the same sense.

Latent Semantic Indexing
Looks for the main dimensions of variation, similar to PCA, eigenvalue methods. Uses a name-by-description matrix.
Look into how this works and if it's possible to implement in a short-term project.

More on Stop Words
Several lists of common stop words already exist. Typically, people subtract/add words as needed for the particular application (e.g. remove 'across' because it contains information about spatial relationships, add 'data' because all descriptions are about data).

WordNet
WordNet is a large thesaurus compiled by hand (NOT automated). It's structured in a hierarchy tree such that the synonyms, meronyms, hyponyms, meronyms, etc. of a word can be located via tree traversals.
Can be used to build something similar to a latent semantic index. This would probably be error-prone though- e.g. the most common sense of a word isn't necessarily the one we want.
Is there a way to look at ontological distances between any two words in the hierarchy tree?
Look into ontology alignment.

Other Comments

Good parsers/taggers/chunkers are slow, especially when used on long sentences. Is that a problem here?
It's conceivable that term descriptions may not contain complete sentences- which will likely cause problems for most parsers. Perhaps there are resources that help determine whether a sentence is well-formed.
Given the length of this project, it would be easier to stick with unsupervised rather than semi-supervised learning algorithms.

Short Term Goals
Work on functions to handle stemming and word counts.
Look into ACL documents, latent semantic indexing, ontology alignment and the capabilities of WordNet.

Tuesday, June 30, 2009

Still having trouble with IDLE. It shouldn't be constantly not responding and taking up 70% of the CPU when I just click on the window and start typing.

Looked into using Eclipse with PyDev; either it or I can't properly find the Python interpreter.

Looked into using Eric, but couldn't install one of the prerequisites because it appears that there's a piece of it missing.

It's possible to use the Unix terminal, but it doesn't like having multi-line scripts pasted into it.

Monday, June 29, 2009

A couple of things

This morning, I started thinking about how to generate a list of stop words. Looks like it can't be just by frequency though. Here are the 100 most common tokens appearing in the FGDC term descriptions:

['the', '.', 'of', ',', 'data', 'and', 'a', 'set', 'or', 'to', 'for', 'in', '(', 'as', 'is', '"', 'used', 'which', 'information', 'needed', 'Repeat', 'name', 'that', 'projection', '-', 'an', 'by', 'be', ')', 'on', ':', 'from', 'description', 'values', 'system', 'are', 'parameters', 'about', 'point', 'reference', 'time', 'metadata', 'contains', 'expressed', 'The', 'identification', 'coordinate', 'attribute', 'value', 'with', 'Rank', 'Taxon', 'date', 'measure', 'Information', 's', 'objects', 'map', 'any', 'Earth', 'surface', 'include', 'number', 'distance', 'accuracy', 'provided', 'file', 'field', 'longitude', 'can', 'latitude', 'event', 'means', 'geologic', 'restrictions', 'organization', 'other', 'Taxonomic', 'two', 'this', 'This', 'individual', 'Name', 'Value', 'For', 'between', 'Classification', 'use', 'example', 'not', ').', 'assigned', 'source', 'spatial', 'standard', 'measured', 'computer', 'period', 'using', 'line']

Some of these obviously are more useful to keep than others, and there's no clear break between frequent, useless words and rarer useful ones (e.g. 'time' appears early in the list but 'using' is in position 99).

Is it worth trying to automatically generate a list of stop words? Do such lists already exist? Or would it make sense to start with an ad hoc list and refine it as time goes on?

-----

Recently, IDLE (the development environment packaged with python) has been maxing out my CPU, even right after a restart. I upgraded to the newest version of Python, discovered that NLTK isn't supported on it, and installed the last version that works with NLTK.

Unfortunately, re-installing NLTK puts it where only the older version (the one I was using prior to this mess) can access it. Not sure how to fix this problem- I can't find any uninstall option for any of these.

I'm also looking for a different development environment that won't tax the processor so badly, but haven't gotten anything up and running yet.

UPDATE: Got IDLE so it sees NLTK again. I can get back to work again! Except that IDLE still maxes out the CPU.

Friday, June 26, 2009

In terms of the overall project, my plan is to analyze the term descriptions using natural language tools, with the output of this process being two lists of one or more tokens. A comparison function would quantify the similarity between these lists.

There seem to be two main approaches: measuring the longest subsequence, and quantifying the overlap based on shared elements.

Since the token lists can be thought of as sequences made up of tokens, analogous to the way that strings are sequences of characters, I've been looking at a few string similarity measures. The most famous of those appear to be:

1. The Levenshtein (or Edit) Distance
Characterized by the the minimum number of operations needed to transform one sequence into the other
operations include: insertion, deletion and substitution of an element (a modified version also allows transposition)

2. Jaro-Winkler Distance
Characterized by the number of elements from one sequence match elements in another sequence- in this case, the concept of matching includes not only the content of the element, but its position within the sequence. Sequences that match from the beginning are considered more similar than those that do not.
Works best on short sequences.

3. Dice Coefficient
Compares two sets. When strings are compared, the set elements are typically bigrams, rather than characters.

4. Jaccard Index:
Compares two sets by dividing their intersection by their union.

None of these jumps out as an ideal measure:
The Levenshtein Distance and Jaro-Winkler Distance emphasize the position of elements in the sequence- whereas what we're interested is more the syntactic relatedness of elements within the sequences. In addition, calculation of the Levenshtein distance appears to be somewhat time-consuming.

The Dice Coeffiecient and Jaccard Index both rely on sets, which by nature ignore duplicate elements- but the duplication of the same word in two descriptons may be a good indicator of similarity.

Perhaps it would make sense to modify one or more of these to create an applicable similarity measure or measures?

I also had a look at the Lucene similarity class- it looks good, but it'd take a lot of reworking, because it's intended to match pre-indexed documents to a set of keywords, rather than to compare two lists of tokens.

Any ideas? What am I overlooking?

Wednesday, June 24, 2009

New function added:

readTerms()
reads terms from a .txt version of the XML format for terms, and returns a Python dictionary. The keys are always the term names, but the user may specify whether the values are the terms' description, hierarchy path, type, domain, or short name. It is also possible to specify that all term information be included in the value field.

The format of the XML file read by readTerms() is pasted from an e-mail from Dave.
NOTE: I've added extra tags for the FGDC-specific fields of type, domain and short name.

/simple/xpath/to/term
Description of the term.
type of term
domain over which the term is relevant
short name of term

...

Tuesday, June 23, 2009

Finished the extract() function.
Its output can be found in my SVN trunk, in fgdc_files/fgdc_terms.txt

Steps to get this output:
1. save FGDC standard as a .txt document.
2. Run the paths() function to create a dictionary of each term and its hierarchy path.
3. Pass the dictionary to dictWriter(), to save it.
Or put it directly into extract(), as a parameter.
4. Run extract(). It will ask for the following file paths:

FGDC Standard
FGDC hierarchy
Where to put the output
URL of FGDC Standard

NOTE: Not all of the terms in fgdc_terms.txt have hierarchy paths, due to inconsistencies between the FGDC standard and the hierarchy XML file.

Saturday, June 20, 2009

Lately, when I've started IDLE (the python editor), it can't find the NLTK files to import. When I install the NLTK toolkit, IDLE can find it the first time or two, and then loses it.

I've installed NLTK twice already and am about to do it a third time. Any ideas?

Friday, June 19, 2009

New Additions

Wrote a couple of new functions for dealing with the FGDC hierarchy:

paths()
reads a .txt version of the XML hierarchy for the FGDC bio standard. Returns a dictionary containing the short name of each term as a keyword, and its path as a value.

The algorithm takes advantage of the fact that parent elements are defined before their children, and that each parent term contains references to each child. The file is read line-by-line; the root element written into the dictionary with identical key and value. The paths of its children are created by appending each child's name to the parent's path. When every subsequent element is encountered, its path is located in the dictionary, and its childrens' hierarchy paths are added to the dictionary.

dictWriter()
writes an existing dictionary (e.g. the pathway dictionary) to a text file, separating keys and elements with a user-defined delimiter (default is tab).

UPDATE: Added a directory to the SVN repository (in trunk) to hold the FGDC bio standard and related text files (including the output of path() and dictWriter())

Tuesday, June 16, 2009

Notes From Meeting with Bruce

Looks like Namrata and I have slightly different parser output formats. And the term and descriptions are associated by context only- they're on adjacent lines. A three-column format might be better; XML would also be an improvement. XML has existing parsers, and the FGDC bio hierarchy is already in XML.
Add more explicit documentation to the code, as well as commit comments.
Similarity scores.

Similarity scores are numerical values that will be used to describe the similarity between either two descriptions or two term/description pairs. They'll be based on the number of tokens that appear in both descriptions.
There are a couple of issues involved in computing them:

Longer descriptions are more likely to share more tokens.
Descriptions may not be the same length.
Stemming can affect the number of matches.
So can stop words and stock words.

As far as description length is concerned, the score can be normalized to the arithmetic or geometric mean of the two description lengths (but which one is better suited to this purpose?). A first pass approach might be to sum the percentage of tokens from description A appearing in description B, and the percentage of tokens from description B appearing in description A.

TO DO:

Read about XML.
Read about SVN keywords.
Similarity scores.

Monday, June 15, 2009

Progress...

Uploaded two new Python functions:

1. termReader(term_file, terms = {}) reads a text file with term/description pairs (format described below) and writes them into a dictionary. The terms parameter allows for the addition of terms to an existing dictionary.

2. tokens(terms) takes a dictionary and tokenizes the values using the NLTK's WordPunctTokenizer(). It currently writes the tokenized lists back into the same dictionary; I'm planning to change that. I'll also try to think of a more apt name for this function.
(Question: Is context information lost when the strings are tokenized?)

All in all, this is moving along a little bit slowly so far, because I haven't used Python in a long time. There's a lot of going and looking things up. As I get a better feel of it, chances are I'll come back to this early code and take advantage of more of the functionality, instead of looping through everything.

Friday, June 12, 2009

This week, I've been putting together some Python code to extract the fields and descriptions from the FGDC bio standard, which can be found here:

http://www.nbii.gov/images/uploaded/151871_1162508205217_BDP-workbook.doc

It takes terms in this format:

1.2.3 Term -- description blah blah blah
Type: type
Short Name: shortname

and writes a text file in this format:

term = "Term"
description = " description blah blah blah
Type: type
Short Name: shortname
"

This is the same output format as Namrata uses, where the term name and description are enclosed in quotes.

The bio elements in FGDC begin with BDP1.2.3 instead of 1.2.3, so I'll take that into account in the code before committing it to SVN.

--------- HOWEVER --------
So far, I haven't written in a way to preserve the hierarchy. The numbering 1.2.3 alludes to it, but the actual names of the levels are given in diagrams of the FGDC standard. Not sure if there's a(n easy) way to automatically pull out that information. Ideas are welcome.

Otherwise, I'll probably store that information in the code itself and access it as needed, for the sake of having it working soon with the current standard at least.

UPDATE: Committed to SVN.

Monday, May 25, 2009

First post of the coding period....

Got a blog set up. I think it'll be interesting to keep the code log online, rather than a text file on my computer.

Here's what I've done so far:
1. Wrote an abstract for the VDC wiki.
2. Tried to get started with Subversion, with very little luck. Looks like the computer can't find the installed software.
3. Made a blog!
4. Learned how to spell Albuquerque.
5. Installed the Natural Language Toolkit

TO DO:
1. Keep trying to figure out Subversion
2. Remember how PYTHON works (after 2 years of using R exclusively... hmm...)
3. Play with the Natural Language toolkit.
4. Get some scheduling done.
5. Work on presentation for Albuquerque.