Alex Tweedly wrote:

Many years ago (2004 ?) I posted code to do something like his using
split/combine to differentiate between 'inside' and 'outside' field
delimiters. It was very fast - but pretty hard to follow, and I don't
remember now which obscure cases it handled (we haven't even mentioned
doubled characters and backslash escaped field delimiters yet :-)

...and then there's in-data returns and other anomalies to account for, which were quite challenging when this was hashed this out here back in '04.

A popular data set for testing the effectiveness of a CSV parser is this one, which we find on many pages discussing the evils of CSV:


FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123


Handy as it is, I like to spice it up by including an in-data return, which is an acceptable practice in CSV (note "At the Plaza"):


FirstName,LastName,Address,City,State,Zip
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At
 the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123


Back in that '04 discussion we kicked around a number of algos here and from that I came up with one that walked through each character, keeping track of when it was inside of quoted data and when it wasn't so it could assemble the resulting tab-delimited data sensibly. It was accurate, but slow.

Famous for being able to speed up darn near any parsing task with a split command, you modified the algo to use an array, and even with the overhead of split it was about five times faster.

I had integrated that idea into the original accurate-but-slow function, and with your mods it was then accurate-and-fast.

But three months later you revisited the thread to note an anomaly with Rev's handling of array keys which required one more modification in order to remain robust across larger data sets.

The final result --- with your note on the array key mod -- is here, and in my tests it not only outperforms most other alternatives but also more accurately preserves in-data quotes and in-data returns*:

<http://lists.runrev.com/pipermail/use-livecode/2004-October/045496.html>

At the end of that post you noted:

  Obviously it will be slower - but "slow and correct" beats "fast
  and wrong" :-)

FWIW, more recent testing shows the difference in applying your last mod to be about a microsecond using the test data above, not the sort of speed impairment worth worrying about.



Good moral upbringing and a sense of responsibility to humanity compels me to note Postel's Law in any discussion of the ridiculously insane inefficiency inherent in parsing CSV:

  "Be liberal in what you accept, and
   conservative in what you send"
<http://www.ietf.org/rfc/rfc1122.txt>

While it may be necessary from time to time to be able to import CSV, the format is long overdue for extinction and should, for the benefit of a saner and more productive world, never be exported.

A longer rant on this is here for those amused by such things:
<http://lists.runrev.com/pipermail/use-livecode/2010-March/136194.html>

:)


* The resulting tab-delimited format follows the convention used by FileMaker Pro and others to use ASCII 11 as the replacement for return within data.

--
 Richard Gaskin
 Fourth World
 LiveCode training and consulting: http://www.fourthworld.com
 Webzine for LiveCode developers: http://www.LiveCodeJournal.com
 LiveCode Journal blog: http://LiveCodejournal.com/blog.irv

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to