I've written a function called termRanker() that takes a tokenized description and a dictionary of term names and their corresponding (tokenized) descriptions. It returns a dictionary whose keys are ranks and whose values are tuples consisting of (term name, term description, score). This tuple can be unpacked easily.
An Overview of termRanker's Inner Workings:
IN: description, dictionary of terms: descriptions
OUT: A dictionary of rank: (name, description, score)
1. filter descriptions
a. decide on filter(s) and how to compound them if necessary
b. create new dictionary with
key = term name
value = filtered description
2. compute similarity scores
a. call similarity() function
b. create new dictionary with
key = score
value = term name
3. produce a ranking
a. pull up terms in the order of decreasing score
b. assign rank
c. store them in a dictionary with
key = rank
value = name, description, score
4. return results
At the moment, termRanker() employs nullFilter() from filters.py, and cossim() from similarity.py. The null filter returns the descriptions without modification, so all rankings are based solely on word frequencies.
termRanker() works fairly well on descriptions that contain similar but distinctive words. For example, ranking the FGDC term 'access constraints' returns:
- access constraints
- metadata access constraints
- use constraints
- metadata use constraints
- security information
- taxonomic completeness
- ordering information
- etc....
The first five results were the same when I passed a paraphrase of the description of access constraints into termRanker(). It should be noted, however, that the descriptions of the first four hits are very similar in structure. Using other terms, for example 'metadata citation' as the reference term returns nonsense.
TO DO:
- Add a function to prep_fns.py to undo escapes for XML
- test termRanker() with different filter functions. This requires figuring out how to call functions using some sort of variable- since termRanker() will call different functions, they can't be stuck straight into the code.
- Maybe look into using chunkers as filter functions.
- Is there any good way to compare the quality of the rankings produced by different filters and filter combinations?