[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Stéphane Ducasse Wed, 06 Jan 2021 00:35:50 -0800

Another point: 

In open-source and in this community. 
Either the code people mentioned is open-source and accessible or it does not 
exist.
If it does not exist then this is easy :)


S.

> On 6 Jan 2021, at 05:10, Richard O'Keefe <rao...@gmail.com> wrote:
> 
> NeoCSVReader is described as efficient.  What is that
> in comparison to?  What benchmark data are used?
> Here are benchmark results measured today.
> (5,000 data line file, 9,145,009 characters).
>  method                time(ms)
>  Just read characters   410
>  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x CSVParser
>  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x CSVParser
>  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 reference.
> 
> (10,000 data line file, 1,544,836 characters).
>  method                time(ms)
>  Just read characters    93
>  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x CSVParser 
>  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x CSVParser 
>  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 reference.
> 
> CSVParser is just 78 lines and is not customisable.  It really is
> stripped to pretty much an absolute minimum.  All of the parsers
> were configured (if that made sense) to return an Array of Strings.
> Many of the CSV files I've worked with use short records instead
> of ending a line with a lot of commas.  Some of them also have the occasional 
> stray comment off to the right, not mentioned in the header.
> I've also found it necessary to skip multiple lines at the beginning
> and/or end.  (Really, some government agencies seem to have NO idea
> that anyone might want to do more with a CSV file than eyeball it in
> Excel.)
> 
> If there is a benchmark suite I can use to improve CSVDecoder,
> I would like to try it out.
> 
> On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de 
> <mailto:jtuc...@objektfabrik.de> <jtuc...@objektfabrik.de 
> <mailto:jtuc...@objektfabrik.de>> wrote:
> Happy new year to all of you! May 2021 be an increasingly less crazy 
> year than 2020...
> 
> 
> I have a question that sounds a bit strange, but we have two effects 
> with NeoCSVReader related to wrong definitions of the reader.
> 
> One effect is that reading a Stream #upToEnd leads to an endless loop, 
> the other is that the Reader produces twice as many objects as there are 
> lines in the file that is being read.
> 
> In both scenarios, the reason is that the CSV Reader has a wrong number 
> of column definitions.
> 
> Of course that is my fault: why do I feed a "malformed" CSV file to poor 
> NeoCSVReader?
> 
> Let me explain: we have a few import interfaces which end users can 
> define using a more or less nice assistant in our Application. The CSV 
> files they upload to our App come from third parties like payment 
> providers, banks and other sources. These change their file structures 
> whenever they feel like it and never tell anybody. So a CSV import that 
> may have been working for years may one day tear a whole web server 
> image down because of a wrong number of fieldAccessors. This is bad on 
> many levels.
> 
> You can easily try the doubling effect at home: define a working CSV 
> Reader and comment out one of the addField: commands before you use the 
> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 
> columns each. If you remove one of the fieldAccessors, an #upToEnd will 
> yoield an Array of 6 objects rather than 3.
> 
> I haven't found the reason for the cases where this leads to an endless 
> loop, but at least this one is clear...
> 
> I *guess* this is due to the way #readEndOfLine is implemented. It seems 
> to not peek forward to the end of the line. I have the gut feeling 
> #peekChar should peek instead of reading the #next character form the 
> input Stream, but #peekChar has too many senders to just go ahead and 
> mess with it ;-)
> 
> So I wonder if there are any tried approaches to this problem.
> 
> One thing I might do is not use #upToEnd, but read each line using 
> PositionableStream>>#nextLine and first check each line if the number of 
> separators matches the number of fieldAccessors minus 1 (and go through 
> the hoops of handling separators in quoted fields and such...). Only if 
> that test succeeds, I would then hand a Stream with the whole line to 
> the reader and do a #next.
> 
> This will, however, mean a lot of extra cycles for large files. Of 
> course I could do this only for some lines, maybe just the first one. 
> Whatever.
> 
> 
> But somehow I have the feeling I should get an exception telling me the 
> line is not compatible to the Reader's definition or such. Or 
> #readAtEndOrEndOfLine should just walk the line to the end and ignore 
> the rest of the line, returnong an incomplete object....
> 
> 
> Maybe I am just missing the right setting or switch? What best practices 
> did you guys come up with for such problems?
> 
> 
> Thanks in advance,
> 
> 
> Joachim
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

--------------------------------------------
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org 
03 59 35 87 52
Assistant: Aurore Dalle 
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley, 
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to