An Overview of termRanker's Inner Workings:
IN: description, dictionary of terms: descriptions
OUT: A dictionary of rank: (name, description, score)
1. filter descriptions
a. decide on filter(s) and how to compound them if necessary
b. create new dictionary with
key = term name
value = filtered description
2. compute similarity scores
a. call similarity() function
b. create new dictionary with
key = score
value = term name
3. produce a ranking
a. pull up terms in the order of decreasing score
b. assign rank
c. store them in a dictionary with
key = rank
value = name, description, score
4. return results
At the moment, termRanker() employs nullFilter() from filters.py, and cossim() from similarity.py. The null filter returns the descriptions without modification, so all rankings are based solely on word frequencies.
termRanker() works fairly well on descriptions that contain similar but distinctive words. For example, ranking the FGDC term 'access constraints' returns:
- access constraints
- metadata access constraints
- use constraints
- metadata use constraints
- security information
- taxonomic completeness
- ordering information
- etc....
TO DO:
- Add a function to prep_fns.py to undo escapes for XML
- test termRanker() with different filter functions. This requires figuring out how to call functions using some sort of variable- since termRanker() will call different functions, they can't be stuck straight into the code.
- Maybe look into using chunkers as filter functions.
- Is there any good way to compare the quality of the rankings produced by different filters and filter combinations?
No comments:
Post a Comment