[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

jtuc...@objektfabrik.de Tue, 05 Jan 2021 10:53:17 -0800

Hi Sven,


all I can say is: wow. I have no words.

I will have to learn a bit about Pharo and github real quick now inorder to try your changes....


Thank you very much. I'll give you feedback as fast as I can.

(And forget my questions about #readAtEndOrEndOfLine. I somhow didn'tunderstand it is expected to return a Boolean. Not sure why. I thoughtof 'read' as a command, not a question in simple past..., so I thoughtits job should be to read the rest of the line if we're not there yet)



Joachim









Am 05.01.21 um 17:49 schrieb Sven Van Caekenberghe:

Hi Joachim,

Have a look at the following commit:

  
https://github.com/svenvc/NeoCSV/commit/a3d6258c28138fe3b15aa03ae71cf1e077096d39

and specifically the added unit tests. These should help clarify the new 
behaviour.

If anything is not clear, please ask.

HTH,

Sven

On 5 Jan 2021, at 08:49, jtuc...@objektfabrik.de wrote:

Sven,

first of all thanks a lot for taking your time with this!

Your test case is so beautifully small I can't believe it ;-)

While I think some kind of validation could help with parsing CSV, I remember 
reading your comment on this in some other discussion long ago. You wrote you 
don't see it as a responsibility of a parser and that you wouldn't want to add 
this to NeoCSV. I must say I tend to agree mostly. Whatever you do at parsing 
can only cover part of the problems related to validation. There will be checks 
that require access to other fields from the same line, or some object that 
will be the owner of the Collection that you are just importing, so a lot of 
validation must be done after parsing anyways.

So I think we can mostly ignore the validation part. Whatever a reader will do, 
it will not be good enough.

A nice way of exposing conversion errors for fields created with 
#addField:converter: would help a lot, however.

I am glad you agree on the underflow bug. This is more a question of 
well-formedness than of validation. If a reader finds out it doesn't fit for a 
file structure, it should tell the user/developer about it or at least 
gracefully return some more or less incomplete object resembling what it could 
parse. But it shouldn't cross line borders and return a wrong number of objects.


I will definitely continue my hunt for the endless loop. It is not an ideal 
situation if one user of our Seaside Application completely blocks an image 
that may be serving a few other users by just using a CVS parser that doesn't 
fit with the file. I suspect this has to do with #readEndOfLine in some special 
case of the underflow bug, but cannot prove it yet. But I have a file and 
parser that reliably goes into an endless loop. I just need to isolate the bare 
CSV parsing from the whole machinery we've build around NeoCSV reader for these 
user-defined mappings... I wouldn't be surprised if it is a problem buried 
somewhere in our preparations in building a parser from user-defined data... I 
will report my progress here, I promise!


One question I keep thinking about in NeoCSV: You implemented a method called #peekChar, 
but it doesn't #peek. It buffers a character and does read the #next character. I tried 
replacing the #next with #peek, but that is definitely a shortcut to 100% CPU, because 
#peekChar is used a lot, not only for consuming an "unmapped remainder" of a 
line... I somehow have the feeling that at least in #readEndOfLine the next char should 
bee peeked instead of consumed in order to find out if it's workload or part of the 
crlf/lf...
Shouldn't a reader step forward by using #peek to see whether there is more 
data after all fieldAccessors have been applied to the line (see 
#readNextRecordAsObject)? Otoh, at one point the reader has to skip to the next 
line, so I am not sure if peek has any place here... I need to debug a little 
more to understand...



Joachim






Am 04.01.21 um 20:57 schrieb Sven Van Caekenberghe:

Hi Joachim,

Thanks for the detailed feedback. This is most helpful. I need to think more 
about this and experiment a bit. This is what I came up with in a 
Workspace/Playground:

input := 'foo,1
bar,2
foobar,3'.

(NeoCSVReader on: input readStream) upToEnd.
(NeoCSVReader on: input readStream) addField; upToEnd.
(NeoCSVReader on: input readStream) addField; addField; addField; upToEnd.

(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj 
:str | obj at: #one put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; addField: [ :obj 
:str | obj at: #one put: str]; addField: [ :obj :str | obj at: #two put: str]; 
addField: [ :obj :str | obj at: #three put: str]; upToEnd.
(NeoCSVReader on: input readStream) recordClass: Dictionary; emptyFieldValue: 
#passNil; addField: [ :obj :str | obj at: #one put: str]; addField: [ :obj :str 
| obj at: #two put: str]; addField: [ :obj :str | obj at: #three put: str]; 
upToEnd.

In my opinion there are two distinct issues:

1. what to do when you define a specific number of fields to be read and there 
are not enough of them in the input (underflow), or there are too many of them 
in the input (overflow).

it is clear that the underflow case is wrong and a bug that has to be fixed.
the overflow case seems OK (resulting in nil fields)

2. to validate the input (a functionality not yet present)

this would basically mean to signal an error in the under or overflow case.
but wrong type conversions should be errors too.

I understand that you want to validate foreign input.

It is a pity that you cannot produce an infinite loop example, that would also 
be useful.

That's it for now, I will come back to you.

Regards,

Sven

On 4 Jan 2021, at 14:46, jtuc...@objektfabrik.de wrote:

Please find attached a small test case to demonstrate what I mean. There is 
just some nonsense Business Object class and a simple test case in this fileout.


Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazy year than 
2020...


I have a question that sounds a bit strange, but we have two effects with 
NeoCSVReader related to wrong definitions of the reader.

One effect is that reading a Stream #upToEnd leads to an endless loop, the 
other is that the Reader produces twice as many objects as there are lines in 
the file that is being read.

In both scenarios, the reason is that the CSV Reader has a wrong number of 
column definitions.

Of course that is my fault: why do I feed a "malformed" CSV file to poor 
NeoCSVReader?

Let me explain: we have a few import interfaces which end users can define 
using a more or less nice assistant in our Application. The CSV files they 
upload to our App come from third parties like payment providers, banks and 
other sources. These change their file structures whenever they feel like it 
and never tell anybody. So a CSV import that may have been working for years 
may one day tear a whole web server image down because of a wrong number of 
fieldAccessors. This is bad on many levels.

You can easily try the doubling effect at home: define a working CSV Reader and 
comment out one of the addField: commands before you use the NeoCSVReader to 
parse a CSV file. Say your CSV file has 3 lines with 4 columns each. If you 
remove one of the fieldAccessors, an #upToEnd will yoield an Array of 6 objects 
rather than 3.

I haven't found the reason for the cases where this leads to an endless loop, 
but at least this one is clear...

I *guess* this is due to the way #readEndOfLine is implemented. It seems to not 
peek forward to the end of the line. I have the gut feeling #peekChar should 
peek instead of reading the #next character form the input Stream, but 
#peekChar has too many senders to just go ahead and mess with it ;-)

So I wonder if there are any tried approaches to this problem.

One thing I might do is not use #upToEnd, but read each line using 
PositionableStream>>#nextLine and first check each line if the number of 
separators matches the number of fieldAccessors minus 1 (and go through the hoops of 
handling separators in quoted fields and such...). Only if that test succeeds, I 
would then hand a Stream with the whole line to the reader and do a #next.

This will, however, mean a lot of extra cycles for large files. Of course I 
could do this only for some lines, maybe just the first one. Whatever.


But somehow I have the feeling I should get an exception telling me the line is 
not compatible to the Reader's definition or such. Or #readAtEndOrEndOfLine 
should just walk the line to the end and ignore the rest of the line, returnong 
an incomplete object....


Maybe I am just missing the right setting or switch? What best practices did 
you guys come up with for such problems?


Thanks in advance,


Joachim

--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


<NeoCSVEndlessLoopTest.st>


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to