Tuesday, August 4, 2009

Sidetracks and Bug Fixes

Since Friday, I've been working on a new option for formatting the XML term file. The file I have right now associates each term with its description. But for purposes of compatibility with Namrata's output, I'm writing code to produce an alternate output, where each term contains its own description, as well as the description of every term above it in its hierarchy path.

Here are the (somewhat convoluted) steps to getting that output:
1. Extract the hierarchy paths from the hierarchy XML file (more on this in a minute).
2. Extract the FGDC terms as before.
3. Read them into a dictionary with key = 'PATH' and value = 'DESC'
4. Call expandDesc() to expand the descriptions to include everything from the path
5. Write to a new XML file with dictToXML()

This is how it would work, anyway, if there wasn't a giant problem with the first step.

As it turns out, this unearthed a bug in the original code. Originally, the hierarchy paths were read into a dictionary, whose keys were the short names of the terms. But of course, this caused multiple paths to be overwritten. So I've deprecated the paths() function and written getPaths(), which returns a list of ALL paths.

Now I just have to figure out how to associate the right path to the right term when writing to the XML file. And then it'll be back to doing ranking comparisons.

UPDATE: Since the appearance of a term in multiple hierarchy paths appears to be multiple instances of the same term (i.e. each term is defined only once in the hierarchy file), the extract() function will make as many entries for that term as it has paths. This preserves the hierarchy structure, with xpath as a unique identifier.

UPDATE: The FGDC Bio Standard Definition and hierarcy files that I have don't match up. It looks like there are a number of valid paths in the hierarchy file that are not addressed in the standard definition.

No comments:

Post a Comment