Tuesday, July 28, 2009

Comparison of Rankings (Preliminary)

Just to expand on the ad hoc method I described yesterday- Here's a list of some term clusters I'll be checking out:

G-Ring Information
  • Data Set G-Polygon
  • Data Set G-Polygon Outer G-Ring
  • Data Set G-Polygon Exclusion G-Ring
  • G-Ring Point
  • G-Ring Latitude
  • G-Ring Longitude
  • G-Ring
Bounding Coordinates
  • Bounding Coordinates
  • West Bounding Coordinate
  • East Bounding Coordinate
  • North Bounding Coordinate
  • South Bounding Coordinate
  • Bounding Altitudes
  • Altitude Minimum
  • Altitude Maximum
  • Altitude Distance Units

Taxonomic Information
  • General Taxonomic Coverage
  • Taxonomic Classification
  • Taxon Rank Name
  • Taxon Rank Value
  • Applicable Common Name

All of these terms/concepts exist in both the FGDC Bio and EML standards.

Monday, July 27, 2009

Debugging Note
On Friday, I discovered Python's weirdness with mutable default arguments. Default arguments are given space in memory when their function is defined, not when it is called- so Python sets them once and never checks up on them again. Usually, that's fine... but when the default is mutable and its value is changed, it retains the changed value until such time as it's updated again. It acts like a secret global variable, and it's a very, very special thing to debug.

Comparison of Filter Functions
I'm working on a preliminary/exploratory assessment of the rankings that are produced by filter functions. Since the metadata standards tend to be pretty long, I'm looking at clusters of related terms that have direct crosswalks between FGDC and EML. A better ranking will be characterized by the appearance of cluster members (a) close together and (b) near the top of the list.

Results to follow....

Thursday, July 23, 2009

a couple of new and important things

The other day, I finally got around to trying out my termReader() function on Namrata's file of EML terms. That prompted a re-write of the entire function, which is now much improved.

New termReader() capabilities:
  1. Reads XML files, regardless of whether the tags open on a new line. The old version of this function assumed that the formatting was much more strict.
  2. Gives the user the option of what information to use as keys in the returned dictionary. Previously, only the term names were used, but EML appears not to have unique term names. So the new version allows the use of hierarchy paths as keys- this option is set as the default.
In addition, because the original aim of this project was to take a combinatoric approach to developing a good ranking algorithm, I've added a callFilters(), a new function called by termRanker(). It takes a list of filter functions passed into termRanker() as a parameter, and applies them to the descriptions in the term dictionary. I've also created a help function to go along with it, which lists the available filter functions and how to pass them into termRanker().

Tuesday, July 21, 2009

Modified tokens() in prep_fns.py so that it will correctly remove punctuation and XML escapes from the tokenized term descriptions.

Friday, July 17, 2009

I think the output of termRanker() is pretty flexible (see previous post). But it depends on what a user would do with the rankings. Here are a few things that someone might do with a term ranking:
  1. View it, or parts of it.
  2. Find a particular term to see its rank.
  3. Look at relative scores between terms.
  4. Refine it? This gets us into new and interesting territory.

Exciting News

I've written a function called termRanker() that takes a tokenized description and a dictionary of term names and their corresponding (tokenized) descriptions. It returns a dictionary whose keys are ranks and whose values are tuples consisting of (term name, term description, score). This tuple can be unpacked easily.

An Overview of termRanker's Inner Workings:
IN: description, dictionary of terms: descriptions
OUT: A dictionary of rank: (name, description, score)

1. filter descriptions
a. decide on filter(s) and how to compound them if necessary
b. create new dictionary with
key = term name
value = filtered description
2. compute similarity scores
a. call similarity() function
b. create new dictionary with
key = score
value = term name
3. produce a ranking
a. pull up terms in the order of decreasing score
b. assign rank
c. store them in a dictionary with
key = rank
value = name, description, score
4. return results

At the moment, termRanker() employs nullFilter() from filters.py, and cossim() from similarity.py. The null filter returns the descriptions without modification, so all rankings are based solely on word frequencies.

termRanker() works fairly well on descriptions that contain similar but distinctive words. For example, ranking the FGDC term 'access constraints' returns:
  1. access constraints
  2. metadata access constraints
  3. use constraints
  4. metadata use constraints
  5. security information
  6. taxonomic completeness
  7. ordering information
  8. etc....
The first five results were the same when I passed a paraphrase of the description of access constraints into termRanker(). It should be noted, however, that the descriptions of the first four hits are very similar in structure. Using other terms, for example 'metadata citation' as the reference term returns nonsense.

TO DO:
  • Add a function to prep_fns.py to undo escapes for XML
  • test termRanker() with different filter functions. This requires figuring out how to call functions using some sort of variable- since termRanker() will call different functions, they can't be stuck straight into the code.
  • Maybe look into using chunkers as filter functions.
  • Is there any good way to compare the quality of the rankings produced by different filters and filter combinations?

Wednesday, July 15, 2009

XML Update

Got the XML issue cleared up. When I was writing the function, I forgot to replace things like '<' and '>' with escape characters. I added a function to fgdc_extractor.py that takes a string and returns the same string, with all necessary characters escaped for XML. This is called by the extract() function before it writes anything to file.

I sent the output to Namrata and she told me already that she can open it (The new XML file is in the SVN trunk as well, by the way).

More Filter Functions

I added six functions to filters.py. They retrieve the following related words from lemmas and/or synsets of the words in a term description:
  • hypernym distances
  • hypernyms
  • instance hypernyms
  • member holonyms
  • part holonyms
  • substance holonyms
As with yesterday's functions, these can be applied to either lemmas (e.g. words) or synsets. The lemma option performs very poorly most of the time, which is odd because synsets contain lemmas (I think). But rather than returning a subset of the synset option's output, the lemma option typically adds nothing at all to the description.

It's interesting that the WordNet online browser returns direct hypernyms and hyponyms, full hypernyms and hyponyms, inherited hypernyms and sister terms, etc. but the NLTK functions don't make those distinctions. Is there another Python library that has a better representation of WordNet than NLTK?