VDC 09 Development Blog: July 2009

Tuesday, July 28, 2009

Comparison of Rankings (Preliminary)

Just to expand on the ad hoc method I described yesterday- Here's a list of some term clusters I'll be checking out:

G-Ring Information

Data Set G-Polygon
Data Set G-Polygon Outer G-Ring
Data Set G-Polygon Exclusion G-Ring
G-Ring Point
G-Ring Latitude
G-Ring Longitude
G-Ring

Bounding Coordinates

Bounding Coordinates
West Bounding Coordinate
East Bounding Coordinate
North Bounding Coordinate
South Bounding Coordinate
Bounding Altitudes
Altitude Minimum
Altitude Maximum
Altitude Distance Units

Taxonomic Information

General Taxonomic Coverage
Taxonomic Classification
Taxon Rank Name
Taxon Rank Value
Applicable Common Name

All of these terms/concepts exist in both the FGDC Bio and EML standards.

Monday, July 27, 2009

Debugging Note
On Friday, I discovered Python's weirdness with mutable default arguments. Default arguments are given space in memory when their function is defined, not when it is called- so Python sets them once and never checks up on them again. Usually, that's fine... but when the default is mutable and its value is changed, it retains the changed value until such time as it's updated again. It acts like a secret global variable, and it's a very, very special thing to debug.

Comparison of Filter Functions
I'm working on a preliminary/exploratory assessment of the rankings that are produced by filter functions. Since the metadata standards tend to be pretty long, I'm looking at clusters of related terms that have direct crosswalks between FGDC and EML. A better ranking will be characterized by the appearance of cluster members (a) close together and (b) near the top of the list.

Results to follow....

Thursday, July 23, 2009

a couple of new and important things

The other day, I finally got around to trying out my termReader() function on Namrata's file of EML terms. That prompted a re-write of the entire function, which is now much improved.

New termReader() capabilities:

Reads XML files, regardless of whether the tags open on a new line. The old version of this function assumed that the formatting was much more strict.
Gives the user the option of what information to use as keys in the returned dictionary. Previously, only the term names were used, but EML appears not to have unique term names. So the new version allows the use of hierarchy paths as keys- this option is set as the default.

In addition, because the original aim of this project was to take a combinatoric approach to developing a good ranking algorithm, I've added a callFilters(), a new function called by termRanker(). It takes a list of filter functions passed into termRanker() as a parameter, and applies them to the descriptions in the term dictionary. I've also created a help function to go along with it, which lists the available filter functions and how to pass them into termRanker().

Tuesday, July 21, 2009

Modified tokens() in prep_fns.py so that it will correctly remove punctuation and XML escapes from the tokenized term descriptions.

Friday, July 17, 2009

I think the output of termRanker() is pretty flexible (see previous post). But it depends on what a user would do with the rankings. Here are a few things that someone might do with a term ranking:

View it, or parts of it.
Find a particular term to see its rank.
Look at relative scores between terms.
Refine it? This gets us into new and interesting territory.

Exciting News

I've written a function called termRanker() that takes a tokenized description and a dictionary of term names and their corresponding (tokenized) descriptions. It returns a dictionary whose keys are ranks and whose values are tuples consisting of (term name, term description, score). This tuple can be unpacked easily.

An Overview of termRanker's Inner Workings:
IN: description, dictionary of terms: descriptions
OUT: A dictionary of rank: (name, description, score)

1. filter descriptions
a. decide on filter(s) and how to compound them if necessary
b. create new dictionary with
key = term name
value = filtered description
2. compute similarity scores
a. call similarity() function
b. create new dictionary with
key = score
value = term name
3. produce a ranking
a. pull up terms in the order of decreasing score
b. assign rank
c. store them in a dictionary with
key = rank
value = name, description, score
4. return results

At the moment, termRanker() employs nullFilter() from filters.py, and cossim() from similarity.py. The null filter returns the descriptions without modification, so all rankings are based solely on word frequencies.

termRanker() works fairly well on descriptions that contain similar but distinctive words. For example, ranking the FGDC term 'access constraints' returns:

access constraints
metadata access constraints
use constraints
metadata use constraints
security information
taxonomic completeness
ordering information
etc....

The first five results were the same when I passed a paraphrase of the description of access constraints into termRanker(). It should be noted, however, that the descriptions of the first four hits are very similar in structure. Using other terms, for example 'metadata citation' as the reference term returns nonsense.

TO DO:

Add a function to prep_fns.py to undo escapes for XML
test termRanker() with different filter functions. This requires figuring out how to call functions using some sort of variable- since termRanker() will call different functions, they can't be stuck straight into the code.
Maybe look into using chunkers as filter functions.
Is there any good way to compare the quality of the rankings produced by different filters and filter combinations?

Wednesday, July 15, 2009

XML Update

Got the XML issue cleared up. When I was writing the function, I forgot to replace things like '<' and '>' with escape characters. I added a function to fgdc_extractor.py that takes a string and returns the same string, with all necessary characters escaped for XML. This is called by the extract() function before it writes anything to file.

I sent the output to Namrata and she told me already that she can open it (The new XML file is in the SVN trunk as well, by the way).

More Filter Functions

I added six functions to filters.py. They retrieve the following related words from lemmas and/or synsets of the words in a term description:

hypernym distances
hypernyms
instance hypernyms
member holonyms
part holonyms
substance holonyms

As with yesterday's functions, these can be applied to either lemmas (e.g. words) or synsets. The lemma option performs very poorly most of the time, which is odd because synsets contain lemmas (I think). But rather than returning a subset of the synset option's output, the lemma option typically adds nothing at all to the description.

It's interesting that the WordNet online browser returns direct hypernyms and hyponyms, full hypernyms and hyponyms, inherited hypernyms and sister terms, etc. but the NLTK functions don't make those distinctions. Is there another Python library that has a better representation of WordNet than NLTK?

Tuesday, July 14, 2009

Today's commits

Here are the changes I committed to the SVN trunk today:

filters.py and similarity.py

Fixed the keywords.

fgdc_extractor.py
Minor bug fixes:

Bug in dictReader() that didn't open the file if the path was passed in as a parameter.
Bug in extract() that asked for path to hierarchy file even if a hierarchy dictionary is provided as a parameter.

filters.py
Added several WordNet-based functions, to retrieve the following types of related words for every word in a term description:

hyponyms
instance hyponyms
member meronyms
part meronyms
substance meronyms
entailments
attibutes
causes
also-sees
verb groups
similar-tos
pertainyms
derivationally related forms

In each case, the results are appended to the description and the full list is returned. There is also an option to include the hyponyms/meronyms/etc for each word AND its synonyms. I've included this because there are often no results returned when only the word is considered. I'm not sure why that is- either WordNet is less rich than expected, or there's a very subtle error in the code.

TO DO

Have a look at the WordNet hierarchy.

Figure out why the XML output of fgdc_extractor.extract() is producing errors. Namrata has suggested that this may be due to the presence of '<' in the text.

Write filter functions to get hypernyms and holonyms.

Write a function to filter term descriptions and rank them relative to a given description.

Wednesday, July 8, 2009

Added to the SVN trunk:

nltk_stopwords.txt- the list of stop words taken from the NLTK web site. A lot of them aren't applicable to our purposes (e.g. yourselves, alone, greetings) but some of them we may want to treat as useful words (e.g. contains, value).
vdc_stopwords.txt- a more basic list, containing mostly obvious stop words (e.g. the, as, or) and FGDC-specific words (e.g. repeat, which only appears in "repeat as needed").

Links to the stop word lists can be found here:
https://code.ecoinformatics.org/code/vdc/projects/machlearn/trunk/

Modifications to prep_fns.py:

Added readList(), a function that reads the lines of a file into a list. Primarily for use in reading files of stop words.
Modified readTerms() to lowercase all input. This facilitates comparisons between tokens.

TO DO:

Encapsulate some WordNet functions to use in processing term descriptions.
Write a function that filters and compares descriptions, and returns a similarity ranking for the entire list (with respect to a single term).

Tuesday, July 7, 2009

some new code

I committed a couple of things to the svn repository:

1. similarity.py
Contains a function to calculate the cosine similarity between two token lists. These lists can
either be tokenized descriptions, or processed versions thereof.

2. filters.py
Contains a number of functions to process and modify tokenized descriptions- including two
stemming functions, a function to throw out stop words, etc.

similarity scoring

After doing a lot of reading, I've decided to use cosine similarity to quantify the similarity between lists of tokens.

Cosine similarity is characterized by the cosine of the angle between two vectors that represent the token lists.

e.g. If area and volume are defined as:
area: length multiplied by width
volume: length multiplied by width multiplied by height

we can create a vector for each description, with as many cells as unique words in the set of BOTH descriptions (length, multiplied, by, width, height). The value in each cell is the number of occurrences of that word in the description.

So we would get something like this:
area = [1, 1 , 1 , 1, 0]
volume = [1, 2, 2, 1, 1]

From here, the calculation of the cosine is straightforward (http://upload.wikimedia.org/math/a/f/3/af3da3b8fcced86a3e7c8cf4075ed94c.png).

For the current example, the result would be 0.9 (i.e. the descriptions are extremely similar).

Advantages of using Cosine Distance

Widely used in linguistics
Easy to implement
Flexible: Vectors can be defined in multiple ways, e.g. by deleting stop words, adding synonyms, weighting rare words, or keeping only lexical chains.
Similarity is based on content, not on order (If it becomes clear that word order is important in the similarity of descriptions, we can weight the similarity value with an order-dependent coefficient).

Limitations

Does not take word sense into account. But this is true of most, if not all, single-equation measures. If it becomes an issue, I think that word sense can be dealt with at the time of vector formation.

UPDATE:
1. Wrote a function to calculate cosine similarity given two token lists.
2. Eclipse is working a whole lot better than IDLE. No lagging at all yet.

Thursday, July 2, 2009

Notes from Teleconference

The Question
How do we quantify similarity between descriptions? A lot of the usual measures rely heavily on sequence order, but not on the meaning of sequence elements- which makes sense with DNA sequences and character strings, but not so much with lists of words.

Some Different Approaches From Various Fields

Translation (Between natural languages)
Uses documents with the same content in different languages in training sets. Since different languages have different word orders, there might be some similarity measures defined in these papers that are primarily content-based.
Look into ACL conference proceedings (ACLweb.org)

Document Classification
Uses word counts, semantic overlap to quantify similarity. This is done by stemming the words in each document, and throwing out the stop words (i.e. common words with little or no semantic content). The results are usually compared by cosine similarity.

caveat: each document should be large (whereas term descriptions are typically only a sentence or two.)
Description size could be increased by pulling in the synonyms of every word. But we'd have to be careful of what sense the original word is used in, and whether the synonyms are used in the same sense.

Latent Semantic Indexing
Looks for the main dimensions of variation, similar to PCA, eigenvalue methods. Uses a name-by-description matrix.
Look into how this works and if it's possible to implement in a short-term project.

More on Stop Words
Several lists of common stop words already exist. Typically, people subtract/add words as needed for the particular application (e.g. remove 'across' because it contains information about spatial relationships, add 'data' because all descriptions are about data).

WordNet
WordNet is a large thesaurus compiled by hand (NOT automated). It's structured in a hierarchy tree such that the synonyms, meronyms, hyponyms, meronyms, etc. of a word can be located via tree traversals.
Can be used to build something similar to a latent semantic index. This would probably be error-prone though- e.g. the most common sense of a word isn't necessarily the one we want.
Is there a way to look at ontological distances between any two words in the hierarchy tree?
Look into ontology alignment.

Other Comments

Good parsers/taggers/chunkers are slow, especially when used on long sentences. Is that a problem here?
It's conceivable that term descriptions may not contain complete sentences- which will likely cause problems for most parsers. Perhaps there are resources that help determine whether a sentence is well-formed.
Given the length of this project, it would be easier to stick with unsupervised rather than semi-supervised learning algorithms.

Short Term Goals
Work on functions to handle stemming and word counts.
Look into ACL documents, latent semantic indexing, ontology alignment and the capabilities of WordNet.

VDC 09 Development Blog