[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Sven Van Caekenberghe Wed, 06 Jan 2021 02:44:06 -0800

Joachim,

> On 6 Jan 2021, at 11:21, jtuc...@objektfabrik.de wrote:
> 
> Hi Sven,
> 
> 
> I must say I am really happy with your change. We get a nice exception 
> whenever the number of fieldAccessor doesn't match with the number of defined 
> fieldAccessors. So far it also seems the endless loops are gone as well. What 
> a leap forward!


Thank you for your kind words.

But thank you as well: it really helps to get constructive feedback from actual 
users, to improve the code for everyone.

> I'm adding an issue on github about the conversion errors, I hope that is a 
> convenient place for such comments/ideas?

Did you see NeoCSVReaderTests>>#testConversionErrors ?

It is not perfect, but you do get an error when a number conversion fails, you 
could make your own conversions fail similarly.

 (NeoCSVReader on: 'a' readStream) addIntegerField; upToEnd.

Like you said: some validation can be done at the CSV level, but certainly not 
everything.

Sven

> Joachim
> 
> 
> 
> 
> 
> 
> Am 05.01.21 um 21:06 schrieb jtuc...@objektfabrik.de:
>> Sven,
>> 
>> 
>> I tested your change with the file and filter (our own way of defining csv 
>> mappings by the end users) which used to send our application into an 
>> endless loop.
>> 
>> And voila: we get an exception instead of a frozen image! I will give the 
>> conversion errors a test drive tomorrow.
>> 
>> I am absolutely happy with your change. Thank you very much.
>> 
>> 
>> Joachim
>> 
>> 
>> P.S: I even learned a little bit about Iceberg. I am not really sure each of 
>> my mouse clicks made sense, but I had your commit in the image and could 
>> test it and port the deltas over to my Smalltalk dialect...
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Am 05.01.21 um 19:52 schrieb jtuc...@objektfabrik.de:
>>> Hi Sven,
>>> 
>>> 
>>> all I can say is: wow. I have no words.
>>> 
>>> I will have to learn a bit about Pharo and github real quick now in order 
>>> to try your changes....
>>> 
>>> Thank you very much. I'll give you feedback as fast as I can.
>>> 
>>> (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't 
>>> understand it is expected to return a Boolean. Not sure why. I thought of 
>>> 'read' as a command, not a question in simple past..., so I thought its job 
>>> should be to read the rest of the line if we're not there yet)
>>> 
>>> 
>>> Joachim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:
>>>> Hi Joachim,
>>>> 
>>>> Have a look at the following commit:
>>>> 
>>>> https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39
>>>>  
>>>> 
>>>> and specifically the added unit tests. These should help clarify the new 
>>>> behaviour.
>>>> 
>>>> If anything is not clear, please ask.
>>>> 
>>>> HTH,
>>>> 
>>>> Sven
>>>> 
>>>>> On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote:
>>>>> 
>>>>> Sven,
>>>>> 
>>>>> first of all thanks a lot for taking your time with this!
>>>>> 
>>>>> Your test case is so beautifully small I can't believe it ;-)
>>>>> 
>>>>> While I think some kind of validation could help with parsing CSV, I 
>>>>> remember reading your comment on this in some other discussion long ago. 
>>>>> You wrote you don't see it as a responsibility of a parser and that you 
>>>>> wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. 
>>>>> Whatever you do at parsing can only cover part of the problems related to 
>>>>> validation. There will be checks that require access to other fields from 
>>>>> the same line, or some object that will be the owner of the Collection 
>>>>> that you are just importing, so a lot of validation must be done after 
>>>>> parsing anyways.
>>>>> 
>>>>> So I think we can mostly ignore the validation part. Whatever a reader 
>>>>> will do, it will not be good enough.
>>>>> 
>>>>> A nice way of exposing conversion errors for fields created with 
>>>>> #addField:converter: would help a lot, however.
>>>>> 
>>>>> I am glad you agree on the underflow bug. This is more a question of 
>>>>> well-formedness than of validation. If a reader finds out it doesn't fit 
>>>>> for a file structure, it should tell the user/developer about it or at 
>>>>> least gracefully return some more or less incomplete object resembling 
>>>>> what it could parse. But it shouldn't cross line borders and return a 
>>>>> wrong number of objects.
>>>>> 
>>>>> 
>>>>> I will definitely continue my hunt for the endless loop. It is not an 
>>>>> ideal situation if one user of our Seaside Application completely blocks 
>>>>> an image that may be serving a few other users by just using a CVS parser 
>>>>> that doesn't fit with the file. I suspect this has to do with 
>>>>> #readEndOfLine in some special case of the underflow bug, but cannot 
>>>>> prove it yet. But I have a file and parser that reliably goes into an 
>>>>> endless loop. I just need to isolate the bare CSV parsing from the whole 
>>>>> machinery we've build around NeoCSV reader for these user-defined 
>>>>> mappings... I wouldn't be surprised if it is a problem buried somewhere 
>>>>> in our preparations in building a parser from user-defined data... I will 
>>>>> report my progress here, I promise!
>>>>> 
>>>>> 
>>>>> One question I keep thinking about in NeoCSV: You implemented a method 
>>>>> called #peekChar, but it doesn't #peek. It buffers a character and does 
>>>>> read the #next character. I tried replacing the #next with #peek, but 
>>>>> that is definitely a shortcut to 100% CPU, because #peekChar is used a 
>>>>> lot, not only for consuming an "unmapped remainder" of a line... I 
>>>>> somehow have the feeling that at least in #readEndOfLine the next char 
>>>>> should bee peeked instead of consumed in order to find out if it's 
>>>>> workload or part of the crlf/lf...
>>>>> Shouldn't a reader step forward by using #peek to see whether there is 
>>>>> more data after all fieldAccessors have been applied to the line (see 
>>>>> #readNextRecordAsObject)? Otoh, at one point the reader has to skip to 
>>>>> the next line, so I am not sure if peek has any place here... I need to 
>>>>> debug a little more to understand...
>>>>> 
>>>>> 
>>>>> 
>>>>> Joachim
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:
>>>>>> Hi Joachim,
>>>>>> 
>>>>>> Thanks for the detailed feedback. This is most helpful. I need to think 
>>>>>> more about this and experiment a bit. This is what I came up with in a 
>>>>>> Workspace/Playground:
>>>>>> 
>>>>>> input := 'foo,1
>>>>>> bar,2
>>>>>> foobar,3'.
>>>>>> 
>>>>>> (NeoCSVReader on: input readStream) upToEnd.
>>>>>> (NeoCSVReader on: input readStream) addField; upToEnd.
>>>>>> (NeoCSVReader on: input readStream) addField; addField; addField; 
>>>>>> upToEnd.
>>>>>> 
>>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ 
>>>>>> :obj :str | obj at: #one put: str]; upToEnd.
>>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ 
>>>>>> :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two 
>>>>>> put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd.
>>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; 
>>>>>> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: 
>>>>>> str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj 
>>>>>> :str | obj at: #three put: str]; upToEnd.
>>>>>> 
>>>>>> In my opinion there are two distinct issues:
>>>>>> 
>>>>>> 1. what to do when you define a specific number of fields to be read and 
>>>>>> there are not enough of them in the input (underflow), or there are too 
>>>>>> many of them in the input (overflow).
>>>>>> 
>>>>>> it is clear that the underflow case is wrong and a bug that has to be 
>>>>>> fixed.
>>>>>> the overflow case seems OK (resulting in nil fields)
>>>>>> 
>>>>>> 2. to validate the input (a functionality not yet present)
>>>>>> 
>>>>>> this would basically mean to signal an error in the under or overflow 
>>>>>> case.
>>>>>> but wrong type conversions should be errors too.
>>>>>> 
>>>>>> I understand that you want to validate foreign input.
>>>>>> 
>>>>>> It is a pity that you cannot produce an infinite loop example, that 
>>>>>> would also be useful.
>>>>>> 
>>>>>> That's it for now, I will come back to you.
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Sven
>>>>>> 
>>>>>>> On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote:
>>>>>>> 
>>>>>>> Please find attached a small test case to demonstrate what I mean. 
>>>>>>> There is just some nonsense Business Object class and a simple test 
>>>>>>> case in this fileout.
>>>>>>> 
>>>>>>> 
>>>>>>> Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de:
>>>>>>>> Happy new year to all of you! May 2021 be an increasingly less crazy 
>>>>>>>> year than 2020...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I have a question that sounds a bit strange, but we have two effects 
>>>>>>>> with NeoCSVReader related to wrong definitions of the reader.
>>>>>>>> 
>>>>>>>> One effect is that reading a Stream #upToEnd leads to an endless loop, 
>>>>>>>> the other is that the Reader produces twice as many objects as there 
>>>>>>>> are lines in the file that is being read.
>>>>>>>> 
>>>>>>>> In both scenarios, the reason is that the CSV Reader has a wrong 
>>>>>>>> number of column definitions.
>>>>>>>> 
>>>>>>>> Of course that is my fault: why do I feed a "malformed" CSV file to 
>>>>>>>> poor NeoCSVReader?
>>>>>>>> 
>>>>>>>> Let me explain: we have a few import interfaces which end users can 
>>>>>>>> define using a more or less nice assistant in our Application. The CSV 
>>>>>>>> files they upload to our App come from third parties like payment 
>>>>>>>> providers, banks and other sources. These change their file structures 
>>>>>>>> whenever they feel like it and never tell anybody. So a CSV import 
>>>>>>>> that may have been working for years may one day tear a whole web 
>>>>>>>> server image down because of a wrong number of fieldAccessors. This is 
>>>>>>>> bad on many levels.
>>>>>>>> 
>>>>>>>> You can easily try the doubling effect at home: define a working CSV 
>>>>>>>> Reader and comment out one of the addField: commands before you use 
>>>>>>>> the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines 
>>>>>>>> with 4 columns each. If you remove one of the fieldAccessors, an 
>>>>>>>> #upToEnd will yoield an Array of 6 objects rather than 3.
>>>>>>>> 
>>>>>>>> I haven't found the reason for the cases where this leads to an 
>>>>>>>> endless loop, but at least this one is clear...
>>>>>>>> 
>>>>>>>> I *guess* this is due to the way #readEndOfLine is implemented. It 
>>>>>>>> seems to not peek forward to the end of the line. I have the gut 
>>>>>>>> feeling #peekChar should peek instead of reading the #next character 
>>>>>>>> form the input Stream, but #peekChar has too many senders to just go 
>>>>>>>> ahead and mess with it ;-)
>>>>>>>> 
>>>>>>>> So I wonder if there are any tried approaches to this problem.
>>>>>>>> 
>>>>>>>> One thing I might do is not use #upToEnd, but read each line using 
>>>>>>>> PositionableStream>>#nextLine and first check each line if the number 
>>>>>>>> of separators matches the number of fieldAccessors minus 1 (and go 
>>>>>>>> through the hoops of handling separators in quoted fields and 
>>>>>>>> such...). Only if that test succeeds, I would then hand a Stream with 
>>>>>>>> the whole line to the reader and do a #next.
>>>>>>>> 
>>>>>>>> This will, however, mean a lot of extra cycles for large files. Of 
>>>>>>>> course I could do this only for some lines, maybe just the first one. 
>>>>>>>> Whatever.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But somehow I have the feeling I should get an exception telling me 
>>>>>>>> the line is not compatible to the Reader's definition or such. Or 
>>>>>>>> #readAtEndOrEndOfLine should just walk the line to the end and ignore 
>>>>>>>> the rest of the line, returnong an incomplete object....
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Maybe I am just missing the right setting or switch? What best 
>>>>>>>> practices did you guys come up with for such problems?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Joachim
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> -- 
>>>>>>> ----------------------------------------------------------------------- 
>>>>>>> Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de
>>>>>>> Fliederweg 1 http://www.objektfabrik.de
>>>>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
>>>>>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>>>>>>> 
>>>>>>> 
>>>>>>> <NeoCSVEndlessLoopTest.st>
>>>>> 
>>>>> -- 
>>>>> ----------------------------------------------------------------------- 
>>>>> Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de
>>>>> Fliederweg 1 http://www.objektfabrik.de
>>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
>>>>> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>>>>> 
>>> 
>> 
> 
> -- 
> -----------------------------------------------------------------------
> Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
> Fliederweg 1                         http://www.objektfabrik.de
> D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
> Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1
>

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to