[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

jtuc...@objektfabrik.de Wed, 06 Jan 2021 02:22:01 -0800

Hi Sven,

I must say I am really happy with your change. We get a nice exceptionwhenever the number of fieldAccessor doesn't match with the number ofdefined fieldAccessors. So far it also seems the endless loops are goneas well. What a leap forward!

I'm adding an issue on github about the conversion errors, I hope thatis a convenient place for such comments/ideas?


Joachim






Am 05.01.21 um 21:06 schrieb jtuc...@objektfabrik.de:

Sven,
I tested your change with the file and filter (our own way of definingcsv mappings by the end users) which used to send our application intoan endless loop.
And voila: we get an exception instead of a frozen image! I will givethe conversion errors a test drive tomorrow.
I am absolutely happy with your change. Thank you very much.


Joachim
P.S: I even learned a little bit about Iceberg. I am not really sureeach of my mouse clicks made sense, but I had your commit in the imageand could test it and port the deltas over to my Smalltalk dialect...
Am 05.01.21 um 19:52 schrieb jtuc...@objektfabrik.de:
Hi Sven,


all I can say is: wow. I have no words.
I will have to learn a bit about Pharo and github real quick now inorder to try your changes....
Thank you very much. I'll give you feedback as fast as I can.
(And forget my questions about #readAtEndOrEndOfLine. I somhow didn'tunderstand it is expected to return a Boolean. Not sure why. Ithought of 'read' as a command, not a question in simple past..., soI thought its job should be to read the rest of the line if we're notthere yet)
Joachim









Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:
Hi Joachim,

Have a look at the following commit:
https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39
and specifically the added unit tests. These should help clarify thenew behaviour.
If anything is not clear, please ask.

HTH,

Sven
On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)
While I think some kind of validation could help with parsing CSV,I remember reading your comment on this in some other discussionlong ago. You wrote you don't see it as a responsibility of aparser and that you wouldn't want to add this to NeoCSV. I must sayI tend to agree mostly. Whatever you do at parsing can only coverpart of the problems related to validation. There will be checksthat require access to other fields from the same line, or someobject that will be the owner of the Collection that you are justimporting, so a lot of validation must be done after parsing anyways.
So I think we can mostly ignore the validation part. Whatever areader will do, it will not be good enough.
A nice way of exposing conversion errors for fields created with#addField:converter: would help a lot, however.
I am glad you agree on the underflow bug. This is more a questionof well-formedness than of validation. If a reader finds out itdoesn't fit for a file structure, it should tell the user/developerabout it or at least gracefully return some more or less incompleteobject resembling what it could parse. But it shouldn't cross lineborders and return a wrong number of objects.
I will definitely continue my hunt for the endless loop. It is notan ideal situation if one user of our Seaside Applicationcompletely blocks an image that may be serving a few other users byjust using a CVS parser that doesn't fit with the file. I suspectthis has to do with #readEndOfLine in some special case of theunderflow bug, but cannot prove it yet. But I have a file andparser that reliably goes into an endless loop. I just need toisolate the bare CSV parsing from the whole machinery we've buildaround NeoCSV reader for these user-defined mappings... I wouldn'tbe surprised if it is a problem buried somewhere in ourpreparations in building a parser from user-defined data... I willreport my progress here, I promise!
One question I keep thinking about in NeoCSV: You implemented amethod called #peekChar, but it doesn't #peek. It buffers acharacter and does read the #next character. I tried replacing the#next with #peek, but that is definitely a shortcut to 100% CPU,because #peekChar is used a lot, not only for consuming an"unmapped remainder" of a line... I somehow have the feeling thatat least in #readEndOfLine the next char should bee peeked insteadof consumed in order to find out if it's workload or part of thecrlf/lf...Shouldn't a reader step forward by using #peek to see whether thereis more data after all fieldAccessors have been applied to the line(see #readNextRecordAsObject)? Otoh, at one point the reader has toskip to the next line, so I am not sure if peek has any placehere... I need to debug a little more to understand...
Joachim






Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:
Hi Joachim,
Thanks for the detailed feedback. This is most helpful. I need tothink more about this and experiment a bit. This is what I came upwith in a Workspace/Playground:
input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField;upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary;addField: [ :obj :str | obj at: #one put: str]; upToEnd.(NeoCSVReader on: input readStream) recordClass: Dictionary;addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj:str | obj at: #two put: str]; addField: [ :obj :str | obj at:#three put: str]; upToEnd.(NeoCSVReader on: input readStream) recordClass: Dictionary;emptyFieldValue: #passNil; addField: [ :obj :str | obj at: #oneput: str]; addField: [ :obj :str | obj at: #two put: str];addField: [ :obj :str | obj at: #three put: str]; upToEnd.
In my opinion there are two distinct issues:
1. what to do when you define a specific number of fields to beread and there are not enough of them in the input (underflow), orthere are too many of them in the input (overflow).
it is clear that the underflow case is wrong and a bug that has tobe fixed.
the overflow case seems OK (resulting in nil fields)

2. to validate the input (a functionality not yet present)
this would basically mean to signal an error in the under oroverflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.
It is a pity that you cannot produce an infinite loop example,that would also be useful.
That's it for now, I will come back to you.

Regards,

Sven
On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote:
Please find attached a small test case to demonstrate what Imean. There is just some nonsense Business Object class and asimple test case in this fileout.
Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de:
Happy new year to all of you! May 2021 be an increasingly lesscrazy year than 2020...
I have a question that sounds a bit strange, but we have twoeffects with NeoCSVReader related to wrong definitions of thereader.
One effect is that reading a Stream #upToEnd leads to an endlessloop, the other is that the Reader produces twice as manyobjects as there are lines in the file that is being read.
In both scenarios, the reason is that the CSV Reader has a wrongnumber of column definitions.
Of course that is my fault: why do I feed a "malformed" CSV fileto poor NeoCSVReader?
Let me explain: we have a few import interfaces which end userscan define using a more or less nice assistant in ourApplication. The CSV files they upload to our App come fromthird parties like payment providers, banks and other sources.These change their file structures whenever they feel like itand never tell anybody. So a CSV import that may have beenworking for years may one day tear a whole web server image downbecause of a wrong number of fieldAccessors. This is bad on manylevels.
You can easily try the doubling effect at home: define a workingCSV Reader and comment out one of the addField: commands beforeyou use the NeoCSVReader to parse a CSV file. Say your CSV filehas 3 lines with 4 columns each. If you remove one of thefieldAccessors, an #upToEnd will yoield an Array of 6 objectsrather than 3.
I haven't found the reason for the cases where this leads to anendless loop, but at least this one is clear...
I *guess* this is due to the way #readEndOfLine is implemented.It seems to not peek forward to the end of the line. I have thegut feeling #peekChar should peek instead of reading the #nextcharacter form the input Stream, but #peekChar has too manysenders to just go ahead and mess with it ;-)
So I wonder if there are any tried approaches to this problem.
One thing I might do is not use #upToEnd, but read each lineusing PositionableStream>>#nextLine and first check each line ifthe number of separators matches the number of fieldAccessorsminus 1 (and go through the hoops of handling separators inquoted fields and such...). Only if that test succeeds, I wouldthen hand a Stream with the whole line to the reader and do a#next.
This will, however, mean a lot of extra cycles for large files.Of course I could do this only for some lines, maybe just thefirst one. Whatever.
But somehow I have the feeling I should get an exception tellingme the line is not compatible to the Reader's definition orsuch. Or #readAtEndOrEndOfLine should just walk the line to theend and ignore the rest of the line, returnong an incompleteobject....
Maybe I am just missing the right setting or switch? What bestpractices did you guys come up with for such problems?
Thanks in advance,


Joachim
--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


<NeoCSVEndlessLoopTest.st>
--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de
Fliederweg 1 http://www.objektfabrik.de
D-71640 Ludwigsburg http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to