Hi Joachim, Have a look at the following commit:
https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39 and specifically the added unit tests. These should help clarify the new behaviour. If anything is not clear, please ask. HTH, Sven > On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote: > > Sven, > > first of all thanks a lot for taking your time with this! > > Your test case is so beautifully small I can't believe it ;-) > > While I think some kind of validation could help with parsing CSV, I remember > reading your comment on this in some other discussion long ago. You wrote you > don't see it as a responsibility of a parser and that you wouldn't want to > add this to NeoCSV. I must say I tend to agree mostly. Whatever you do at > parsing can only cover part of the problems related to validation. There will > be checks that require access to other fields from the same line, or some > object that will be the owner of the Collection that you are just importing, > so a lot of validation must be done after parsing anyways. > > So I think we can mostly ignore the validation part. Whatever a reader will > do, it will not be good enough. > > A nice way of exposing conversion errors for fields created with > #addField:converter: would help a lot, however. > > I am glad you agree on the underflow bug. This is more a question of > well-formedness than of validation. If a reader finds out it doesn't fit for > a file structure, it should tell the user/developer about it or at least > gracefully return some more or less incomplete object resembling what it > could parse. But it shouldn't cross line borders and return a wrong number of > objects. > > > I will definitely continue my hunt for the endless loop. It is not an ideal > situation if one user of our Seaside Application completely blocks an image > that may be serving a few other users by just using a CVS parser that doesn't > fit with the file. I suspect this has to do with #readEndOfLine in some > special case of the underflow bug, but cannot prove it yet. But I have a file > and parser that reliably goes into an endless loop. I just need to isolate > the bare CSV parsing from the whole machinery we've build around NeoCSV > reader for these user-defined mappings... I wouldn't be surprised if it is a > problem buried somewhere in our preparations in building a parser from > user-defined data... I will report my progress here, I promise! > > > One question I keep thinking about in NeoCSV: You implemented a method called > #peekChar, but it doesn't #peek. It buffers a character and does read the > #next character. I tried replacing the #next with #peek, but that is > definitely a shortcut to 100% CPU, because #peekChar is used a lot, not only > for consuming an "unmapped remainder" of a line... I somehow have the feeling > that at least in #readEndOfLine the next char should bee peeked instead of > consumed in order to find out if it's workload or part of the crlf/lf... > Shouldn't a reader step forward by using #peek to see whether there is more > data after all fieldAccessors have been applied to the line (see > #readNextRecordAsObject)? Otoh, at one point the reader has to skip to the > next line, so I am not sure if peek has any place here... I need to debug a > little more to understand... > > > > Joachim > > > > > > > Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe: >> Hi Joachim, >> >> Thanks for the detailed feedback. This is most helpful. I need to think more >> about this and experiment a bit. This is what I came up with in a >> Workspace/Playground: >> >> input := 'foo,1 >> bar,2 >> foobar,3'. >> >> (NeoCSVReader on: input readStream) upToEnd. >> (NeoCSVReader on: input readStream) addField; upToEnd. >> (NeoCSVReader on: input readStream) addField; addField; addField; upToEnd. >> >> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ >> :obj :str | obj at: #one put: str]; upToEnd. >> (NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ >> :obj :str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two >> put: str]; addField: [ :obj :str | obj at: #three put: str]; upToEnd. >> (NeoCSVReader on: input readStream) recordClass: Dictionary; >> emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #one put: str]; >> addField: [ :obj :str | obj at: #two put: str]; addField: [ :obj :str | obj >> at: #three put: str]; upToEnd. >> >> In my opinion there are two distinct issues: >> >> 1. what to do when you define a specific number of fields to be read and >> there are not enough of them in the input (underflow), or there are too many >> of them in the input (overflow). >> >> it is clear that the underflow case is wrong and a bug that has to be fixed. >> the overflow case seems OK (resulting in nil fields) >> >> 2. to validate the input (a functionality not yet present) >> >> this would basically mean to signal an error in the under or overflow case. >> but wrong type conversions should be errors too. >> >> I understand that you want to validate foreign input. >> >> It is a pity that you cannot produce an infinite loop example, that would >> also be useful. >> >> That's it for now, I will come back to you. >> >> Regards, >> >> Sven >> >>> On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote: >>> >>> Please find attached a small test case to demonstrate what I mean. There is >>> just some nonsense Business Object class and a simple test case in this >>> fileout. >>> >>> >>> Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de: >>>> Happy new year to all of you! May 2021 be an increasingly less crazy year >>>> than 2020... >>>> >>>> >>>> I have a question that sounds a bit strange, but we have two effects with >>>> NeoCSVReader related to wrong definitions of the reader. >>>> >>>> One effect is that reading a Stream #upToEnd leads to an endless loop, the >>>> other is that the Reader produces twice as many objects as there are lines >>>> in the file that is being read. >>>> >>>> In both scenarios, the reason is that the CSV Reader has a wrong number of >>>> column definitions. >>>> >>>> Of course that is my fault: why do I feed a "malformed" CSV file to poor >>>> NeoCSVReader? >>>> >>>> Let me explain: we have a few import interfaces which end users can define >>>> using a more or less nice assistant in our Application. The CSV files they >>>> upload to our App come from third parties like payment providers, banks >>>> and other sources. These change their file structures whenever they feel >>>> like it and never tell anybody. So a CSV import that may have been working >>>> for years may one day tear a whole web server image down because of a >>>> wrong number of fieldAccessors. This is bad on many levels. >>>> >>>> You can easily try the doubling effect at home: define a working CSV >>>> Reader and comment out one of the addField: commands before you use the >>>> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 >>>> columns each. If you remove one of the fieldAccessors, an #upToEnd will >>>> yoield an Array of 6 objects rather than 3. >>>> >>>> I haven't found the reason for the cases where this leads to an endless >>>> loop, but at least this one is clear... >>>> >>>> I *guess* this is due to the way #readEndOfLine is implemented. It seems >>>> to not peek forward to the end of the line. I have the gut feeling >>>> #peekChar should peek instead of reading the #next character form the >>>> input Stream, but #peekChar has too many senders to just go ahead and mess >>>> with it ;-) >>>> >>>> So I wonder if there are any tried approaches to this problem. >>>> >>>> One thing I might do is not use #upToEnd, but read each line using >>>> PositionableStream>>#nextLine and first check each line if the number of >>>> separators matches the number of fieldAccessors minus 1 (and go through >>>> the hoops of handling separators in quoted fields and such...). Only if >>>> that test succeeds, I would then hand a Stream with the whole line to the >>>> reader and do a #next. >>>> >>>> This will, however, mean a lot of extra cycles for large files. Of course >>>> I could do this only for some lines, maybe just the first one. Whatever. >>>> >>>> >>>> But somehow I have the feeling I should get an exception telling me the >>>> line is not compatible to the Reader's definition or such. Or >>>> #readAtEndOrEndOfLine should just walk the line to the end and ignore the >>>> rest of the line, returnong an incomplete object.... >>>> >>>> >>>> Maybe I am just missing the right setting or switch? What best practices >>>> did you guys come up with for such problems? >>>> >>>> >>>> Thanks in advance, >>>> >>>> >>>> Joachim >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> -- >>> ----------------------------------------------------------------------- >>> Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de >>> Fliederweg 1 http://www.objektfabrik.de >>> D-71640 Ludwigsburg http://joachimtuchel.wordpress.com >>> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >>> >>> >>> <NeoCSVEndlessLoopTest.st> > > > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 >