Joachim, > On 6 Jan 2021, at 11:21, jtuc...@objektfabrik.de wrote: > > Hi Sven, > > > I must say I am really happy with your change. We get a nice exception > whenever the number of fieldAccessor doesn't match with the number of defined > fieldAccessors. So far it also seems the endless loops are gone as well. What > a leap forward!
Thank you for your kind words. But thank you as well: it really helps to get constructive feedback from actual users, to improve the code for everyone. > I'm adding an issue on github about the conversion errors, I hope that is a > convenient place for such comments/ideas? Did you see NeoCSVReaderTests>>#testConversionErrors ? It is not perfect, but you do get an error when a number conversion fails, you could make your own conversions fail similarly. (NeoCSVReader on: 'a' readStream) addIntegerField; upToEnd. Like you said: some validation can be done at the CSV level, but certainly not everything. Sven > Joachim > > > > > > > Am 05.01.21 um 21:06 schrieb jtuc...@objektfabrik.de: >> Sven, >> >> >> I tested your change with the file and filter (our own way of defining csv >> mappings by the end users) which used to send our application into an >> endless loop. >> >> And voila: we get an exception instead of a frozen image! I will give the >> conversion errors a test drive tomorrow. >> >> I am absolutely happy with your change. Thank you very much. >> >> >> Joachim >> >> >> P.S: I even learned a little bit about Iceberg. I am not really sure each of >> my mouse clicks made sense, but I had your commit in the image and could >> test it and port the deltas over to my Smalltalk dialect... >> >> >> >> >> >> >> >> Am 05.01.21 um 19:52 schrieb jtuc...@objektfabrik.de: >>> Hi Sven, >>> >>> >>> all I can say is: wow. I have no words. >>> >>> I will have to learn a bit about Pharo and github real quick now in order >>> to try your changes.... >>> >>> Thank you very much. I'll give you feedback as fast as I can. >>> >>> (And forget my questions about #readAtEndOrEndOfLine. I somhow didn't >>> understand it is expected to return a Boolean. Not sure why. I thought of >>> 'read' as a command, not a question in simple past..., so I thought its job >>> should be to read the rest of the line if we're not there yet) >>> >>> >>> Joachim >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe: >>>> Hi Joachim, >>>> >>>> Have a look at the following commit: >>>> >>>> https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 >>>> >>>> >>>> and specifically the added unit tests. These should help clarify the new >>>> behaviour. >>>> >>>> If anything is not clear, please ask. >>>> >>>> HTH, >>>> >>>> Sven >>>> >>>>> On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote: >>>>> >>>>> Sven, >>>>> >>>>> first of all thanks a lot for taking your time with this! >>>>> >>>>> Your test case is so beautifully small I can't believe it ;-) >>>>> >>>>> While I think some kind of validation could help with parsing CSV, I >>>>> remember reading your comment on this in some other discussion long ago. >>>>> You wrote you don't see it as a responsibility of a parser and that you >>>>> wouldn't want to add this to NeoCSV. I must say I tend to agree mostly. >>>>> Whatever you do at parsing can only cover part of the problems related to >>>>> validation. There will be checks that require access to other fields from >>>>> the same line, or some object that will be the owner of the Collection >>>>> that you are just importing, so a lot of validation must be done after >>>>> parsing anyways. >>>>> >>>>> So I think we can mostly ignore the validation part. Whatever a reader >>>>> will do, it will not be good enough. >>>>> >>>>> A nice way of exposing conversion errors for fields created with >>>>> #addField:converter: would help a lot, however. >>>>> >>>>> I am glad you agree on the underflow bug. This is more a question of >>>>> well-formedness than of validation. If a reader finds out it doesn't fit >>>>> for a file structure, it should tell the user/developer about it or at >>>>> least gracefully return some more or less incomplete object resembling >>>>> what it could parse. But it shouldn't cross line borders and return a >>>>> wrong number of objects. >>>>> >>>>> >>>>> I will definitely continue my hunt for the endless loop. It is not an >>>>> ideal situation if one user of our Seaside Application completely blocks >>>>> an image that may be serving a few other users by just using a CVS parser >>>>> that doesn't fit with the file. I suspect this has to do with >>>>> #readEndOfLine in some special case of the underflow bug, but cannot >>>>> prove it yet. But I have a file and parser that reliably goes into an >>>>> endless loop. I just need to isolate the bare CSV parsing from the whole >>>>> machinery we've build around NeoCSV reader for these user-defined >>>>> mappings... I wouldn't be surprised if it is a problem buried somewhere >>>>> in our preparations in building a parser from user-defined data... I will >>>>> report my progress here, I promise! >>>>> >>>>> >>>>> One question I keep thinking about in NeoCSV: You implemented a method >>>>> called #peekChar, but it doesn't #peek. It buffers a character and does >>>>> read the #next character. I tried replacing the #next with #peek, but >>>>> that is definitely a shortcut to 100% CPU, because #peekChar is used a >>>>> lot, not only for consuming an "unmapped remainder" of a line... I >>>>> somehow have the feeling that at least in #readEndOfLine the next char >>>>> should bee peeked instead of consumed in order to find out if it's >>>>> workload or part of the crlf/lf... >>>>> Shouldn't a reader step forward by using #peek to see whether there is >>>>> more data after all fieldAccessors have been applied to the line (see >>>>> #readNextRecordAsObject)? Otoh, at one point the reader has to skip to >>>>> the next line, so I am not sure if peek has any place here... I need to >>>>> debug a little more to understand... >>>>> >>>>> >>>>> >>>>> Joachim >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >>>>>> Hi Joachim, >>>>>> >>>>>> Thanks for the detailed feedback. This is most helpful. I need to think >>>>>> more about this and experiment a bit. This is what I came up with in a >>>>>> Workspace/Playground: >>>>>> >>>>>> input := 'foo,1 >>>>>> bar,2 >>>>>> foobar,3'. >>>>>> >>>>>> (NeoCSVReader on: input readStream) upToEnd. >>>>>> (NeoCSVReader on: input readStream) addField; upToEnd. >>>>>> (NeoCSVReader on: input readStream) addField; addField; addField; >>>>>> upToEnd. >>>>>> >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ >>>>>> :obj :str | obj at: #one put: str]; upToEnd. >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ >>>>>> :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two >>>>>> put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >>>>>> (NeoCSVReader on: input readStream) recordClass: Dictionary; >>>>>> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: >>>>>> str]; addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj >>>>>> :str | obj at: #three put: str]; upToEnd. >>>>>> >>>>>> In my opinion there are two distinct issues: >>>>>> >>>>>> 1. what to do when you define a specific number of fields to be read and >>>>>> there are not enough of them in the input (underflow), or there are too >>>>>> many of them in the input (overflow). >>>>>> >>>>>> it is clear that the underflow case is wrong and a bug that has to be >>>>>> fixed. >>>>>> the overflow case seems OK (resulting in nil fields) >>>>>> >>>>>> 2. to validate the input (a functionality not yet present) >>>>>> >>>>>> this would basically mean to signal an error in the under or overflow >>>>>> case. >>>>>> but wrong type conversions should be errors too. >>>>>> >>>>>> I understand that you want to validate foreign input. >>>>>> >>>>>> It is a pity that you cannot produce an infinite loop example, that >>>>>> would also be useful. >>>>>> >>>>>> That's it for now, I will come back to you. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Sven >>>>>> >>>>>>> On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote: >>>>>>> >>>>>>> Please find attached a small test case to demonstrate what I mean. >>>>>>> There is just some nonsense Business Object class and a simple test >>>>>>> case in this fileout. >>>>>>> >>>>>>> >>>>>>> Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de: >>>>>>>> Happy new year to all of you! May 2021 be an increasingly less crazy >>>>>>>> year than 2020... >>>>>>>> >>>>>>>> >>>>>>>> I have a question that sounds a bit strange, but we have two effects >>>>>>>> with NeoCSVReader related to wrong definitions of the reader. >>>>>>>> >>>>>>>> One effect is that reading a Stream #upToEnd leads to an endless loop, >>>>>>>> the other is that the Reader produces twice as many objects as there >>>>>>>> are lines in the file that is being read. >>>>>>>> >>>>>>>> In both scenarios, the reason is that the CSV Reader has a wrong >>>>>>>> number of column definitions. >>>>>>>> >>>>>>>> Of course that is my fault: why do I feed a "malformed" CSV file to >>>>>>>> poor NeoCSVReader? >>>>>>>> >>>>>>>> Let me explain: we have a few import interfaces which end users can >>>>>>>> define using a more or less nice assistant in our Application. The CSV >>>>>>>> files they upload to our App come from third parties like payment >>>>>>>> providers, banks and other sources. These change their file structures >>>>>>>> whenever they feel like it and never tell anybody. So a CSV import >>>>>>>> that may have been working for years may one day tear a whole web >>>>>>>> server image down because of a wrong number of fieldAccessors. This is >>>>>>>> bad on many levels. >>>>>>>> >>>>>>>> You can easily try the doubling effect at home: define a working CSV >>>>>>>> Reader and comment out one of the addField: commands before you use >>>>>>>> the NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines >>>>>>>> with 4 columns each. If you remove one of the fieldAccessors, an >>>>>>>> #upToEnd will yoield an Array of 6 objects rather than 3. >>>>>>>> >>>>>>>> I haven't found the reason for the cases where this leads to an >>>>>>>> endless loop, but at least this one is clear... >>>>>>>> >>>>>>>> I *guess* this is due to the way #readEndOfLine is implemented. It >>>>>>>> seems to not peek forward to the end of the line. I have the gut >>>>>>>> feeling #peekChar should peek instead of reading the #next character >>>>>>>> form the input Stream, but #peekChar has too many senders to just go >>>>>>>> ahead and mess with it ;-) >>>>>>>> >>>>>>>> So I wonder if there are any tried approaches to this problem. >>>>>>>> >>>>>>>> One thing I might do is not use #upToEnd, but read each line using >>>>>>>> PositionableStream>>#nextLine and first check each line if the number >>>>>>>> of separators matches the number of fieldAccessors minus 1 (and go >>>>>>>> through the hoops of handling separators in quoted fields and >>>>>>>> such...). Only if that test succeeds, I would then hand a Stream with >>>>>>>> the whole line to the reader and do a #next. >>>>>>>> >>>>>>>> This will, however, mean a lot of extra cycles for large files. Of >>>>>>>> course I could do this only for some lines, maybe just the first one. >>>>>>>> Whatever. >>>>>>>> >>>>>>>> >>>>>>>> But somehow I have the feeling I should get an exception telling me >>>>>>>> the line is not compatible to the Reader's definition or such. Or >>>>>>>> #readAtEndOrEndOfLine should just walk the line to the end and ignore >>>>>>>> the rest of the line, returnong an incomplete object.... >>>>>>>> >>>>>>>> >>>>>>>> Maybe I am just missing the right setting or switch? What best >>>>>>>> practices did you guys come up with for such problems? >>>>>>>> >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> >>>>>>>> >>>>>>>> Joachim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> ----------------------------------------------------------------------- >>>>>>> Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de >>>>>>> Fliederweg 1 http://www.objektfabrik.de >>>>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>>>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>>>>>> >>>>>>> >>>>>>> <NeoCSVEndlessLoopTest.st> >>>>> >>>>> -- >>>>> ----------------------------------------------------------------------- >>>>> Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de >>>>> Fliederweg 1 http://www.objektfabrik.de >>>>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>>>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>>>> >>> >> > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >