On 18/10/2015 03:17, Peter M. Brigham wrote:
At this point, finding a function that does the task at all -- reliably and 
taking into account most of the csv malformations we can anticipate -- would be 
a start. So far nothing has been unbreakable. Once we find an algorithm that 
does the job, we can focus on speeding it up.

That is indeed the issue.

There are two distinct problems, and the "best" solutions for each may be different.

1. Optimistic parser.

Properly parse any well-formed CSV data, in any idiosyncratic dialect of CSV that we may be interested in.

Or to put it otherwise, in general we are going to be parsing data produced by some program - it may take some oddball approach to CSV formatting, but it will be "correct" in the program's own terms. We are not (in this problem) trying to handle, e.g., hand-generated files that may contain errors, or have deliberate errors embedded. Thus, we do not expect things like mis-matched quotes, etc. - and it will be adequate to do "something reasonable" given bad input data.

2. Pessimistic parser.

Just the opposite - try to detect any arbitrary malformation with a sensible error message, and properly parse any well-formed CSV data in any dialect we might encounter.

And common to both
- adequate (optional) control over delimiters, escaped characters in the output, etc.
- efficiency (speed) matters

IMHO, we should also specify that the output should
 - remove the enclosing quotes from quoted cells
- reduce doubled-quotes within a quoted cell to the appropriate single instance of a quote in order that the TSV (or array, or whatever output format is chosen) does not need further processing to remove them; i.e. the output data is clean of any CSV formatting artifacts.

Personally, I am a pragmatist, and I have always needed solution 1 above - whenever I've had to parse CSV data, it's because I had a real-world need to do so, and the data was coming from some well-behaved (even if very weird) application - so it was consistent and followed some kind of rules, however wacky those rules might be. Other people may have different needs.

So I believe that any proposed algorithm should be clear about which of these two distinct problems it is trying to solve, and should be judged accordingly. Then each of us can look for the most efficient solution to whichever one they most care about.

I do believe that any solution to problem 2 is also a solution to problem 1 - but I don't know if it can be as efficient while tackling that harder problem.

-- Alex.



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to