[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Richard O'Keefe Wed, 06 Jan 2021 22:16:25 -0800

You aren't sure what point I was making?
How about the one I actually wrote down:
  What test data was NeoCSV benchmarked with
  and can I get my hands on it?
THAT is the point.  The data points I showed (and
many others I have not) are not satisfactory to me.
I have been searching for CSV test collections.
One site offered 6 files of which only one downloaded.
I found a "benchmark suite" for CSV containing no
actual CSV files.
So where *else* should I look for benchmark data than
associated with a parser people in this community are
generally happy with that is described as "efficient"?


Is it so unreasonable to suspect that my results might
be a fluke?  Is it bad manners to assume that something
described as efficient has tests showing that?



On Wed, 6 Jan 2021 at 22:23, jtuc...@objektfabrik.de <
jtuc...@objektfabrik.de> wrote:

> Richard,
>
> I am not sure what point you are trying to make here.
> You have something cooler and faster? Great, how about sharing?
> You could make a faster one when it doesn't convert numbers and stuff?
> Great. I guess the time will be spent after parsing in 95% of the use
> cases. It depends. And that is exactly what you are saying. The word
> efficient means nothing without context. How is that related to this thread?
>
> I think this thread mostly shows the strength of a community, especially
> when there are members who are active, friendly and highly motivated. My
> problem git solved in blazing speed without me paying anything for it. Just
> because Sven thought my problem could be other people's problem as well.
>
> I am happy with NeoCSV's speed, even if there may be more lightweigt and
> faster solutions. Tbh, my main concern with NeoCSV is not speed, but how
> well I can understand problems and fix them. I care about data types on
> parsing. A non-configurable csv parser gives me a bunch of dictionaries and
> Strings. That could be a waste of cycles and memory once you need the data
> as objects.
> My use case is not importing trillions of records all day, and for a few
> hundred or maybe sometimes thousands, it is good/fast enough.
>
>
> Joachim
>
>
>
>
>
> Am 06.01.21 um 05:10 schrieb Richard O'Keefe:
>
> NeoCSVReader is described as efficient.  What is that
> in comparison to?  What benchmark data are used?
> Here are benchmark results measured today.
> (5,000 data line file, 9,145,009 characters).
>  method                time(ms)
>  Just read characters   410
>  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x
> CSVParser
>  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x
> CSVParser
>  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 reference.
>
> (10,000 data line file, 1,544,836 characters).
>  method                time(ms)
>  Just read characters    93
>  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x
> CSVParser
>  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x
> CSVParser
>  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 reference.
>
> CSVParser is just 78 lines and is not customisable.  It really is
> stripped to pretty much an absolute minimum.  All of the parsers
> were configured (if that made sense) to return an Array of Strings.
> Many of the CSV files I've worked with use short records instead
> of ending a line with a lot of commas.  Some of them also have the
> occasional stray comment off to the right, not mentioned in the header.
> I've also found it necessary to skip multiple lines at the beginning
> and/or end.  (Really, some government agencies seem to have NO idea
> that anyone might want to do more with a CSV file than eyeball it in
> Excel.)
>
> If there is a benchmark suite I can use to improve CSVDecoder,
> I would like to try it out.
>
> On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de <
> jtuc...@objektfabrik.de> wrote:
>
>> Happy new year to all of you! May 2021 be an increasingly less crazy
>> year than 2020...
>>
>>
>> I have a question that sounds a bit strange, but we have two effects
>> with NeoCSVReader related to wrong definitions of the reader.
>>
>> One effect is that reading a Stream #upToEnd leads to an endless loop,
>> the other is that the Reader produces twice as many objects as there are
>> lines in the file that is being read.
>>
>> In both scenarios, the reason is that the CSV Reader has a wrong number
>> of column definitions.
>>
>> Of course that is my fault: why do I feed a "malformed" CSV file to poor
>> NeoCSVReader?
>>
>> Let me explain: we have a few import interfaces which end users can
>> define using a more or less nice assistant in our Application. The CSV
>> files they upload to our App come from third parties like payment
>> providers, banks and other sources. These change their file structures
>> whenever they feel like it and never tell anybody. So a CSV import that
>> may have been working for years may one day tear a whole web server
>> image down because of a wrong number of fieldAccessors. This is bad on
>> many levels.
>>
>> You can easily try the doubling effect at home: define a working CSV
>> Reader and comment out one of the addField: commands before you use the
>> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
>> columns each. If you remove one of the fieldAccessors, an #upToEnd will
>> yoield an Array of 6 objects rather than 3.
>>
>> I haven't found the reason for the cases where this leads to an endless
>> loop, but at least this one is clear...
>>
>> I *guess* this is due to the way #readEndOfLine is implemented. It seems
>> to not peek forward to the end of the line. I have the gut feeling
>> #peekChar should peek instead of reading the #next character form the
>> input Stream, but #peekChar has too many senders to just go ahead and
>> mess with it ;-)
>>
>> So I wonder if there are any tried approaches to this problem.
>>
>> One thing I might do is not use #upToEnd, but read each line using
>> PositionableStream>>#nextLine and first check each line if the number of
>> separators matches the number of fieldAccessors minus 1 (and go through
>> the hoops of handling separators in quoted fields and such...). Only if
>> that test succeeds, I would then hand a Stream with the whole line to
>> the reader and do a #next.
>>
>> This will, however, mean a lot of extra cycles for large files. Of
>> course I could do this only for some lines, maybe just the first one.
>> Whatever.
>>
>>
>> But somehow I have the feeling I should get an exception telling me the
>> line is not compatible to the Reader's definition or such. Or
>> #readAtEndOrEndOfLine should just walk the line to the end and ignore
>> the rest of the line, returnong an incomplete object....
>>
>>
>> Maybe I am just missing the right setting or switch? What best practices
>> did you guys come up with for such problems?
>>
>>
>> Thanks in advance,
>>
>>
>> Joachim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> --
> -----------------------------------------------------------------------
> Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de 
> <jtuc...@objektfabrik.de>
> Fliederweg 1                         http://www.objektfabrik.de
> D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>
>
>
>

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to