Feb 8, 2006
Patrick has built an interface to the parsing process so you can see how things are progressing.
First, we have sectioned the records into sets that follow a few gross patterns. There are those records that appear to list a name and then a variant. There are records that consist of a name in brackets and little else. These total just around 3,500 records.
Then there is all the rest. The process we have been taking is to then run various passes against the remaining records. Each pass uses a different set of expressions and are grouped into Rounds 1-11. These rounds are grouped according to similar parsing challenges. Each round is partially parsed already but you will notice that some rounds are more clean than others. The rounds labelled "first round parsing done" means they are pretty clean whereas the rest are still being tackled. Round 11, for example has nomenclatural annotations scattered here and there and some of the author names are break the bounds between author and citation.
Anyhow, the working application links you to a current summary and you can view large blocks or individual records. A details page links the parsed record to the original raw record. The URL is formatted as: