[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Sven Van Caekenberghe Wed, 06 Jan 2021 01:52:56 -0800

Hi Richard,

Benchmarking is a can of worms, many factors have to be considered. But the 
first requirement is obviously to be completely open over what you are doing 
and what you are comparing.


NeoCSV contains a simple benchmark suite called NeoCSVBenchmark, which was used 
during development. Note that it is a bit tricky to use: you need to run a 
write benchmark with a specific configuration before you can try read 
benchmarks.

The core data is a 100.000 line file (2.5 MB) like this:

1,-1,99999
2,-2,99998
3,-3,99997
4,-4,99996
5,-5,99995
6,-6,99994
7,-7,99993
8,-8,99992
9,-9,99991
10,-10,99990
...

That parses in ~250ms on my machine.

NeoCSV has quite a bit of features and handles various edge cases. Obviously, a 
minimal, custom implementation could be faster.

NeoCSV is called efficient not just because it is reasonably fast, but because 
it can be configured to generate domain objects without intermediate structures 
and because it can convert individual fields (parse numbers, dates, times, ...) 
while parsing.

Like you said, some generated CSV output out in the wild is very irregular. I 
try to stick with standard CSV as much as possible.

Sven

> On 6 Jan 2021, at 05:10, Richard O'Keefe <rao...@gmail.com> wrote:
> 
> NeoCSVReader is described as efficient.  What is that
> in comparison to?  What benchmark data are used?
> Here are benchmark results measured today.
> (5,000 data line file, 9,145,009 characters).
>  method                time(ms)
>  Just read characters   410
>  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x CSVParser
>  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x CSVParser
>  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 reference.
> 
> (10,000 data line file, 1,544,836 characters).
>  method                time(ms)
>  Just read characters    93
>  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x CSVParser 
>  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x CSVParser 
>  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 reference.
> 
> CSVParser is just 78 lines and is not customisable.  It really is
> stripped to pretty much an absolute minimum.  All of the parsers
> were configured (if that made sense) to return an Array of Strings.
> Many of the CSV files I've worked with use short records instead
> of ending a line with a lot of commas.  Some of them also have the occasional 
> stray comment off to the right, not mentioned in the header.
> I've also found it necessary to skip multiple lines at the beginning
> and/or end.  (Really, some government agencies seem to have NO idea
> that anyone might want to do more with a CSV file than eyeball it in
> Excel.)
> 
> If there is a benchmark suite I can use to improve CSVDecoder,
> I would like to try it out.
> 
> On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de 
> <jtuc...@objektfabrik.de> wrote:
> Happy new year to all of you! May 2021 be an increasingly less crazy 
> year than 2020...
> 
> 
> I have a question that sounds a bit strange, but we have two effects 
> with NeoCSVReader related to wrong definitions of the reader.
> 
> One effect is that reading a Stream #upToEnd leads to an endless loop, 
> the other is that the Reader produces twice as many objects as there are 
> lines in the file that is being read.
> 
> In both scenarios, the reason is that the CSV Reader has a wrong number 
> of column definitions.
> 
> Of course that is my fault: why do I feed a "malformed" CSV file to poor 
> NeoCSVReader?
> 
> Let me explain: we have a few import interfaces which end users can 
> define using a more or less nice assistant in our Application. The CSV 
> files they upload to our App come from third parties like payment 
> providers, banks and other sources. These change their file structures 
> whenever they feel like it and never tell anybody. So a CSV import that 
> may have been working for years may one day tear a whole web server 
> image down because of a wrong number of fieldAccessors. This is bad on 
> many levels.
> 
> You can easily try the doubling effect at home: define a working CSV 
> Reader and comment out one of the addField: commands before you use the 
> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 
> columns each. If you remove one of the fieldAccessors, an #upToEnd will 
> yoield an Array of 6 objects rather than 3.
> 
> I haven't found the reason for the cases where this leads to an endless 
> loop, but at least this one is clear...
> 
> I *guess* this is due to the way #readEndOfLine is implemented. It seems 
> to not peek forward to the end of the line. I have the gut feeling 
> #peekChar should peek instead of reading the #next character form the 
> input Stream, but #peekChar has too many senders to just go ahead and 
> mess with it ;-)
> 
> So I wonder if there are any tried approaches to this problem.
> 
> One thing I might do is not use #upToEnd, but read each line using 
> PositionableStream>>#nextLine and first check each line if the number of 
> separators matches the number of fieldAccessors minus 1 (and go through 
> the hoops of handling separators in quoted fields and such...). Only if 
> that test succeeds, I would then hand a Stream with the whole line to 
> the reader and do a #next.
> 
> This will, however, mean a lot of extra cycles for large files. Of 
> course I could do this only for some lines, maybe just the first one. 
> Whatever.
> 
> 
> But somehow I have the feeling I should get an exception telling me the 
> line is not compatible to the Reader's definition or such. Or 
> #readAtEndOrEndOfLine should just walk the line to the end and ignore 
> the rest of the line, returnong an incomplete object....
> 
> 
> Maybe I am just missing the right setting or switch? What best practices 
> did you guys come up with for such problems?
> 
> 
> Thanks in advance,
> 
> 
> Joachim
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to