[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Sven Van Caekenberghe Thu, 07 Jan 2021 00:06:00 -0800

> On 7 Jan 2021, at 07:15, Richard O'Keefe <rao...@gmail.com> wrote:
> 
> You aren't sure what point I was making?
> How about the one I actually wrote down:
>   What test data was NeoCSV benchmarked with
>   and can I get my hands on it?
> THAT is the point.  The data points I showed (and
> many others I have not) are not satisfactory to me.
> I have been searching for CSV test collections.
> One site offered 6 files of which only one downloaded.
> I found a "benchmark suite" for CSV containing no
> actual CSV files.
> So where *else* should I look for benchmark data than
> associated with a parser people in this community are
> generally happy with that is described as "efficient"?

Did you actually read my email and look at the code ?

NeoCSVBenchmark generates its own test data.

> Is it so unreasonable to suspect that my results might
> be a fluke?  Is it bad manners to assume that something
> described as efficient has tests showing that?
> 
> 
> 
> On Wed, 6 Jan 2021 at 22:23, jtuc...@objektfabrik.de 
> <jtuc...@objektfabrik.de> wrote:
> Richard,
> 
> I am not sure what point you are trying to make here. 
> You have something cooler and faster? Great, how about sharing? 
> You could make a faster one when it doesn't convert numbers and stuff? Great. 
> I guess the time will be spent after parsing in 95% of the use cases. It 
> depends. And that is exactly what you are saying. The word efficient means 
> nothing without context. How is that related to this thread?
> 
> I think this thread mostly shows the strength of a community, especially when 
> there are members who are active, friendly and highly motivated. My problem 
> git solved in blazing speed without me paying anything for it. Just because 
> Sven thought my problem could be other people's problem as well. 
> 
> I am happy with NeoCSV's speed, even if there may be more lightweigt and 
> faster solutions. Tbh, my main concern with NeoCSV is not speed, but how well 
> I can understand problems and fix them. I care about data types on parsing. A 
> non-configurable csv parser gives me a bunch of dictionaries and Strings. 
> That could be a waste of cycles and memory once you need the data as objects. 
> My use case is not importing trillions of records all day, and for a few 
> hundred or maybe sometimes thousands, it is good/fast enough. 
> 
> 
> Joachim
> 
> 
> 
> 
> 
> Am 06.01.21 um 05:10 schrieb Richard O'Keefe:
>> NeoCSVReader is described as efficient.  What is that
>> in comparison to?  What benchmark data are used?
>> Here are benchmark results measured today.
>> (5,000 data line file, 9,145,009 characters).
>>  method                time(ms)
>>  Just read characters   410
>>  CSVDecoder>>next      3415   astc's CSV reader (defaults). 1.26 x CSVParser
>>  NeoCSVReader>>next    4798   NeoCSVReader (default state). 1.78 x CSVParser
>>  CSVParser>>next       2701   pared-to-the-bone CSV reader. 1.00 reference.
>> 
>> (10,000 data line file, 1,544,836 characters).
>>  method                time(ms)
>>  Just read characters    93
>>  CSVDecoder>>next       530   astc's CSV reader (defaults). 1.26 x CSVParser 
>>  NeoCSVReader>>next     737   NeoCSVReader (default state). 1.75 x CSVParser 
>>  CSVParser>>next        421   pared-to-the-bone CSV reader. 1.00 reference.
>> 
>> CSVParser is just 78 lines and is not customisable.  It really is
>> stripped to pretty much an absolute minimum.  All of the parsers
>> were configured (if that made sense) to return an Array of Strings.
>> Many of the CSV files I've worked with use short records instead
>> of ending a line with a lot of commas.  Some of them also have the 
>> occasional stray comment off to the right, not mentioned in the header.
>> I've also found it necessary to skip multiple lines at the beginning
>> and/or end.  (Really, some government agencies seem to have NO idea
>> that anyone might want to do more with a CSV file than eyeball it in
>> Excel.)
>> 
>> If there is a benchmark suite I can use to improve CSVDecoder,
>> I would like to try it out.
>> 
>> On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de 
>> <jtuc...@objektfabrik.de> wrote:
>> Happy new year to all of you! May 2021 be an increasingly less crazy 
>> year than 2020...
>> 
>> 
>> I have a question that sounds a bit strange, but we have two effects 
>> with NeoCSVReader related to wrong definitions of the reader.
>> 
>> One effect is that reading a Stream #upToEnd leads to an endless loop, 
>> the other is that the Reader produces twice as many objects as there are 
>> lines in the file that is being read.
>> 
>> In both scenarios, the reason is that the CSV Reader has a wrong number 
>> of column definitions.
>> 
>> Of course that is my fault: why do I feed a "malformed" CSV file to poor 
>> NeoCSVReader?
>> 
>> Let me explain: we have a few import interfaces which end users can 
>> define using a more or less nice assistant in our Application. The CSV 
>> files they upload to our App come from third parties like payment 
>> providers, banks and other sources. These change their file structures 
>> whenever they feel like it and never tell anybody. So a CSV import that 
>> may have been working for years may one day tear a whole web server 
>> image down because of a wrong number of fieldAccessors. This is bad on 
>> many levels.
>> 
>> You can easily try the doubling effect at home: define a working CSV 
>> Reader and comment out one of the addField: commands before you use the 
>> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 
>> columns each. If you remove one of the fieldAccessors, an #upToEnd will 
>> yoield an Array of 6 objects rather than 3.
>> 
>> I haven't found the reason for the cases where this leads to an endless 
>> loop, but at least this one is clear...
>> 
>> I *guess* this is due to the way #readEndOfLine is implemented. It seems 
>> to not peek forward to the end of the line. I have the gut feeling 
>> #peekChar should peek instead of reading the #next character form the 
>> input Stream, but #peekChar has too many senders to just go ahead and 
>> mess with it ;-)
>> 
>> So I wonder if there are any tried approaches to this problem.
>> 
>> One thing I might do is not use #upToEnd, but read each line using 
>> PositionableStream>>#nextLine and first check each line if the number of 
>> separators matches the number of fieldAccessors minus 1 (and go through 
>> the hoops of handling separators in quoted fields and such...). Only if 
>> that test succeeds, I would then hand a Stream with the whole line to 
>> the reader and do a #next.
>> 
>> This will, however, mean a lot of extra cycles for large files. Of 
>> course I could do this only for some lines, maybe just the first one. 
>> Whatever.
>> 
>> 
>> But somehow I have the feeling I should get an exception telling me the 
>> line is not compatible to the Reader's definition or such. Or 
>> #readAtEndOrEndOfLine should just walk the line to the end and ignore 
>> the rest of the line, returnong an incomplete object....
>> 
>> 
>> Maybe I am just missing the right setting or switch? What best practices 
>> did you guys come up with for such problems?
>> 
>> 
>> Thanks in advance,
>> 
>> 
>> Joachim
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> -----------------------------------------------------------------------
> Objektfabrik Joachim Tuchel          
> mailto:jtuc...@objektfabrik.de
> 
> Fliederweg 1                         
> http://www.objektfabrik.de
> 
> D-71640 Ludwigsburg                  
> http://joachimtuchel.wordpress.com
> 
> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
> 
> 
>
[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to