Feb 1, 2006


I am finding a small problem with the CSV output format which may give problems to folks who try to import the files into a database.   The problem is based on using a Comma-Separated-Values format for a file that is already loaded with commas.   Generally, if you have data with commas in it already, you wrap each field in quotes so that a parser knows which commas separate fields and which commas are part of the data.   

The problem with the import is first, that if you import the CSV file into a database management system you get a table with 16 columns.  Columns 1-3 are the added uid, volume and page metadata.  The rest are the data itself.  So it's no problem to concatenate cols 4-16 back to the raw data record.  The problem is that there are also quotes in the raw data and so you can end up with single "records" that actually contain hundreds of actual records because when the parser reads a quote it assumes this is all a single field until it hits the next one which may be hundreds of records away.

The fix is simple.  I'd recommend making the downloadable file a tab-separated file.  You have use a regular expression to change the file.  I did this:

^([^,]+),([^,]+),([^,]+),
to
\1\t\2\t\3\t

This expression switches the first 3 commands to tabs.