VDC 09 Development Blog: August 2009

Wednesday, August 19, 2009

some final updates

A couple of new functions added over the weekend:

rankDist() calculates the similarity distance between the first and second hits of one or more rankings. NOTE: This measure is only useful when the first hit is known to be accurate.
compareTerms() applies a filter algorithm to two terms, and calculates their similarity based on the result. Analogous to termRanker(), which compares a reference term against an entire term dictionary.

Other than that, the documentation is done and up; it's the README document under the 'documentation repository' link in the sidebar. It's in .doc format so that it can be modified.

And that's about my stopping point; it's mid-August and time to go back to my thesis. What a summer!

Friday, August 14, 2009

A RESULT!

Ran a comparison of two Darwin Core terms, MaximumElevationInMeters and MinimumElevationInMeters against the FGDC standard. I chose these terms because they appear to be the only ones with obvious exact matches in FGDC: Altitude Maximum and Altitude Minimum.

The filter algorithm I used was:

Augment each description with the WordNet similar-tos of the synsets of every word in that description.
Add every word's synonyms to the description.
Throw away all words with length < 6.

This is the output that comes from a call to correspRank() (with the two rankings as input):
Top Rank
1
Bottom Rank
1
Range
0
Median Rank
1.0

Top Score
0.369239797651
Bottom Score
0.360379846875
Range
0.00885995077593
Median Score
0.364809822263

Note that mean and variance are not reported- I haven't written them in yet because my sample sizes so far are too small.

Let's look at the top 5 hits in both rankings:
MaximumElevationInMeters

Altitude Maximum
Altitude System Definition
Altitude System Definition
False Easting
False Easting

MinimumElevationInMeters

Altitude Minimum
Altitude System Definition
Altitude System Definition
Altitude Maximum
False Easting

Interestingly, Altitude Maximum makes it into the top five for MinimumElevationInMeters, but the reverse is not true.

One that's not so good about this result is that the filter/score/rank/assess process took about 141 seconds of processor time. Ouch!

what's new

Another list of updates!

Changed termRanker() and keyConvert() to recognize the fact that term-to-term similarity scores are not necessarily unique. Removed the option for a score-keyed dictionary as the output of termRanker(). However, keyConvert() can be used to change any ranking dictionary's keys to scores. In that case, the dictionary values are lists of information about one to several terms. These dictionaries can be converted back to rank- and path-keyed dictionaries using keyConvert().
Separated the calculation and reporting of statistics from the other functions in rankComp.py. These are now handled by getStats().
Modified cossim() to return 0 in the case that the filter functions remove all of the words of the description.
Added future directions and recommendations to the documentation file.

Changed callFilters() to call the functions directly out of a dictionary, rather than using and if...elif...elif... setup. This shortens the code considerably, and reduces the number of functions that have to be modified when a new filter is added. Wrote a function, buildFnDict, that builds the dictionary. Its output is also used by filterHelp(), which is also much shorter now.

Tuesday, August 11, 2009

a couple of updates

Practiced the final presentation yesterday- I've made a few changes to the slides and documentation for tomorrow. They'll be available in the new 'Documentation Repository' link in the Related Links section soon.

Other than that, I'm in the middle of debugging some older stuff and some newer stuff.

fixed termReader() so that it reads the last term in the XML file
improved the efficiency of xmlEsc()
IN PROGRESS- working on some problems in the ranking comparison functions.

Friday, August 7, 2009

Uh oh, the official Python documentation seems to be down!

What I did Today:

Practiced next week's presentation using Adobe Connect. Had some interesting adventures with the sound, and got some good feedback.
Removed special characters (I think they're regex special??) from the FGDC output, for Namrata's program, which uses jdom.
Started working on functions to help users compare the quality of the outputs of different algorithms, based on known sub-crosswalks. There will be one that looks at the clustering of related terms, one that looks at the positions of known matches, and a supporting function that corrects dictionary formatting when necessary.

Wednesday, August 5, 2009

some other changes

I've made a few other changes in the last couple of days:

Modifications to the FGDC Bio Standard (fgdcbio.txt) A number of the terms had short names that didn't match up between the hierarchy definition and the standard definition. Because the names appeared multiple times in the hierarchy and once in the standard, I changed them in the standard. I've tried to make minimal changes to the standard (e.g. by coding around typos) because my copy probably isn't going to end up in wide distribution. But for matching terms between the two files, it is necessary that each term has the same short name in both. Here is a list of changes:

placek to placekt
taxonpr to taxonpro
orien to orienta

In readTerms(), removed the option to allow term names as keys in the dictionary- because this field is unique in neither FGDC not EML. Currently, the only allowed keys are xpaths.

Deprecated paths(), dictReader() and dictWriter(). paths() constructs hierarchy paths from fgdc_hierarchy.xml, but writes them into a dictionary. If a term appears in multiple places in the hierarchy, the duplicates are overwritten- a major flaw in the function. dictReader() and dictWriter() only exist as helpers to paths().

Added getPaths(), a replacement for paths(), which constructs paths from fgdc_hierarchy and writes them into a list. Have not added support functions analogous to dictReader() and dictWriter() yet.

Added the following functions to produce output compatible to Namrata's program:

pathDesc(): Given a path and a dictionary (where path is a key in the dictionary), adds the values of every term in the path into the value of the path's final term.

expandDesc(): Applies pathDesc() to an entire dictionary. This is preferable to applying pathDesc() multiple times because the results are order-dependent.

dictToXML(): Writes the contents of a dictionary (with paths as keys) into an XML file in the same format as extract(). Requires an XML file in the same format as input, to supply the information missing from the dictionary (which can only hold two types of information at once).

Moved filePrompt() to prep_fns.py.

Finally finished with the numerous small things that were plaguing the FGDC extractor. The code to write a new XML file with descriptions containing everything from the xpath is working.

I've added the output to SVN, will commit after organizing the code...

Tuesday, August 4, 2009

Sidetracks and Bug Fixes

Since Friday, I've been working on a new option for formatting the XML term file. The file I have right now associates each term with its description. But for purposes of compatibility with Namrata's output, I'm writing code to produce an alternate output, where each term contains its own description, as well as the description of every term above it in its hierarchy path.

Here are the (somewhat convoluted) steps to getting that output:
1. Extract the hierarchy paths from the hierarchy XML file (more on this in a minute).
2. Extract the FGDC terms as before.
3. Read them into a dictionary with key = 'PATH' and value = 'DESC'
4. Call expandDesc() to expand the descriptions to include everything from the path
5. Write to a new XML file with dictToXML()

This is how it would work, anyway, if there wasn't a giant problem with the first step.

As it turns out, this unearthed a bug in the original code. Originally, the hierarchy paths were read into a dictionary, whose keys were the short names of the terms. But of course, this caused multiple paths to be overwritten. So I've deprecated the paths() function and written getPaths(), which returns a list of ALL paths.

Now I just have to figure out how to associate the right path to the right term when writing to the XML file. And then it'll be back to doing ranking comparisons.

UPDATE: Since the appearance of a term in multiple hierarchy paths appears to be multiple instances of the same term (i.e. each term is defined only once in the hierarchy file), the extract() function will make as many entries for that term as it has paths. This preserves the hierarchy structure, with xpath as a unique identifier.

UPDATE: The FGDC Bio Standard Definition and hierarcy files that I have don't match up. It looks like there are a number of valid paths in the hierarchy file that are not addressed in the standard definition.

VDC 09 Development Blog