[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Sven Van Caekenberghe Tue, 05 Jan 2021 08:50:36 -0800

Hi Joachim,

Have a look at the following commit:


 
https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the new 
behaviour.

If anything is not clear, please ask.

HTH,

Sven

> On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote:
> 
> Sven,
> 
> first of all thanks a lot for taking your time with this!
> 
> Your test case is so beautifully small I can't believe it ;-)
> 
> While I think some kind of validation could help with parsing CSV, I remember 
> reading your comment on this in some other discussion long ago. You wrote you 
> don't see it as a responsibility of a parser and that you wouldn't want to 
> add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at 
> parsing can only cover part of the problems related to validation. There will 
> be checks that require access to other fields from the same line, or some 
> object that will be the owner of the Collection that you are just importing, 
> so a lot of validation must be done after parsing anyways.
> 
> So I think we can mostly ignore the validation part. Whatever a reader will 
> do, it will not be good enough.
> 
> A nice way of exposing conversion errors for fields created with 
> #addField:converter: would help a lot, however.
> 
> I am glad you agree on the underflow bug. This is more a question of 
> well-formedness than of validation. If a reader finds out it doesn't fit for 
> a file structure, it should tell the user/developer about it or at least 
> gracefully return some more or less incomplete object resembling what it 
> could parse. But it shouldn't cross line borders and return a wrong number of 
> objects.
> 
> 
> I will definitely continue my hunt for the endless loop. It is not an ideal 
> situation if one user of our Seaside Application completely blocks an image 
> that may be serving a few other users by just using a CVS parser that doesn't 
> fit with the file. I suspect this has to do with #readEndOfLine in some 
> special case of the underflow bug, but cannot prove it yet. But I have a file 
> and parser that reliably goes into an endless loop. I just need to isolate 
> the bare CSV parsing from the whole machinery we've build around NeoCSV 
> reader for these user-defined mappings... I wouldn't be surprised if it is a 
> problem buried somewhere in our preparations in building a parser from 
> user-defined data... I will report my progress here, I promise!
> 
> 
> One question I keep thinking about in NeoCSV: You implemented a method called 
> #peekChar, but it doesn't #peek. It buffers a character and does read the 
> #next character. I tried replacing the #next with #peek, but that is 
> definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only 
> for consuming an "unmapped remainder" of a line... I somehow have the feeling 
> that at least in #readEndOfLine the next char should bee peeked instead of 
> consumed in order to find out if it's workload or part of the crlf/lf...
> Shouldn't a reader step forward by using #peek to see whether there is more 
> data after all fieldAccessors have been applied to the line (see 
> #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the 
> next line, so I am not sure if peek has any place here... I need to debug a 
> little more to understand...
> 
> 
> 
> Joachim
> 
> 
> 
> 
> 
> 
> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:
>> Hi Joachim,
>> 
>> Thanks for the detailed feedback. This is most helpful. I need to think more 
>> about this and experiment a bit. This is what I came up with in a 
>> Workspace/Playground:
>> 
>> input := 'foo,1
>> bar,2
>> foobar,3'.
>> 
>> (NeoCSVReader on: input readStream) upToEnd.
>> (NeoCSVReader on: input readStream) addField; upToEnd.
>> (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.
>> 
>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ 
>> :obj :str | obj at: #one put: str]; upToEnd.
>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ 
>> :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two 
>> put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
>> (NeoCSVReader on: input readStream) recordClass: Dictionary; 
>> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; 
>> addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj 
>> at: #three put: str]; upToEnd.
>> 
>> In my opinion there are two distinct issues:
>> 
>> 1. what to do when you define a specific number of fields to be read and 
>> there are not enough of them in the input (underflow), or there are too many 
>> of them in the input (overflow).
>> 
>> it is clear that the underflow case is wrong and a bug that has to be fixed.
>> the overflow case seems OK (resulting in nil fields)
>> 
>> 2. to validate the input (a functionality not yet present)
>> 
>> this would basically mean to signal an error in the under or overflow case.
>> but wrong type conversions should be errors too.
>> 
>> I understand that you want to validate foreign input.
>> 
>> It is a pity that you cannot produce an infinite loop example, that would 
>> also be useful.
>> 
>> That's it for now, I will come back to you.
>> 
>> Regards,
>> 
>> Sven
>> 
>>> On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote:
>>> 
>>> Please find attached a small test case to demonstrate what I mean. There is 
>>> just some nonsense Business Object class and a simple test case in this 
>>> fileout.
>>> 
>>> 
>>> Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de:
>>>> Happy new year to all of you! May 2021 be an increasingly less crazy year 
>>>> than 2020...
>>>> 
>>>> 
>>>> I have a question that sounds a bit strange, but we have two effects with 
>>>> NeoCSVReader related to wrong definitions of the reader.
>>>> 
>>>> One effect is that reading a Stream #upToEnd leads to an endless loop, the 
>>>> other is that the Reader produces twice as many objects as there are lines 
>>>> in the file that is being read.
>>>> 
>>>> In both scenarios, the reason is that the CSV Reader has a wrong number of 
>>>> column definitions.
>>>> 
>>>> Of course that is my fault: why do I feed a "malformed" CSV file to poor 
>>>> NeoCSVReader?
>>>> 
>>>> Let me explain: we have a few import interfaces which end users can define 
>>>> using a more or less nice assistant in our Application. The CSV files they 
>>>> upload to our App come from third parties like payment providers, banks 
>>>> and other sources. These change their file structures whenever they feel 
>>>> like it and never tell anybody. So a CSV import that may have been working 
>>>> for years may one day tear a whole web server image down because of a 
>>>> wrong number of fieldAccessors. This is bad on many levels.
>>>> 
>>>> You can easily try the doubling effect at home: define a working CSV 
>>>> Reader and comment out one of the addField: commands before you use the 
>>>> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 
>>>> columns each. If you remove one of the fieldAccessors, an #upToEnd will 
>>>> yoield an Array of 6 objects rather than 3.
>>>> 
>>>> I haven't found the reason for the cases where this leads to an endless 
>>>> loop, but at least this one is clear...
>>>> 
>>>> I *guess* this is due to the way #readEndOfLine is implemented. It seems 
>>>> to not peek forward to the end of the line. I have the gut feeling 
>>>> #peekChar should peek instead of reading the #next character form the 
>>>> input Stream, but #peekChar has too many senders to just go ahead and mess 
>>>> with it ;-)
>>>> 
>>>> So I wonder if there are any tried approaches to this problem.
>>>> 
>>>> One thing I might do is not use #upToEnd, but read each line using 
>>>> PositionableStream>>#nextLine and first check each line if the number of 
>>>> separators matches the number of fieldAccessors minus 1 (and go through 
>>>> the hoops of handling separators in quoted fields and such...). Only if 
>>>> that test succeeds, I would then hand a Stream with the whole line to the 
>>>> reader and do a #next.
>>>> 
>>>> This will, however, mean a lot of extra cycles for large files. Of course 
>>>> I could do this only for some lines, maybe just the first one. Whatever.
>>>> 
>>>> 
>>>> But somehow I have the feeling I should get an exception telling me the 
>>>> line is not compatible to the Reader's definition or such. Or 
>>>> #readAtEndOrEndOfLine should just walk the line to the end and ignore the 
>>>> rest of the line, returnong an incomplete object....
>>>> 
>>>> 
>>>> Maybe I am just missing the right setting or switch? What best practices 
>>>> did you guys come up with for such problems?
>>>> 
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> 
>>>> Joachim
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> -- 
>>> -----------------------------------------------------------------------
>>> Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
>>> Fliederweg 1                         http://www.objektfabrik.de
>>> D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>>> 
>>> 
>>> <NeoCSVEndlessLoopTest.st>
> 
> 
> -- 
> -----------------------------------------------------------------------
> Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
> Fliederweg 1                         http://www.objektfabrik.de
> D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to