[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

jtuc...@objektfabrik.de Mon, 04 Jan 2021 05:46:25 -0800

Please find attached a small test case to demonstrate what I mean. Thereis just some nonsense Business Object class and a simple test case inthis fileout.


Am 04.01.21 um 14:36 schrieb jtuc...@objektfabrik.de:

Happy new year to all of you! May 2021 be an increasingly less crazyyear than 2020...
I have a question that sounds a bit strange, but we have two effectswith NeoCSVReader related to wrong definitions of the reader.
One effect is that reading a Stream #upToEnd leads to an endless loop,the other is that the Reader produces twice as many objects as thereare lines in the file that is being read.
In both scenarios, the reason is that the CSV Reader has a wrongnumber of column definitions.
Of course that is my fault: why do I feed a "malformed" CSV file topoor NeoCSVReader?
Let me explain: we have a few import interfaces which end users candefine using a more or less nice assistant in our Application. The CSVfiles they upload to our App come from third parties like paymentproviders, banks and other sources. These change their file structureswhenever they feel like it and never tell anybody. So a CSV importthat may have been working for years may one day tear a whole webserver image down because of a wrong number of fieldAccessors. This isbad on many levels.
You can easily try the doubling effect at home: define a working CSVReader and comment out one of the addField: commands before you usethe NeoCSVReader to parse a CSV file. Say your CSV file has 3 lineswith 4 columns each. If you remove one of the fieldAccessors, an#upToEnd will yoield an Array of 6 objects rather than 3.
I haven't found the reason for the cases where this leads to anendless loop, but at least this one is clear...
I *guess* this is due to the way #readEndOfLine is implemented. Itseems to not peek forward to the end of the line. I have the gutfeeling #peekChar should peek instead of reading the #next characterform the input Stream, but #peekChar has too many senders to just goahead and mess with it ;-)
So I wonder if there are any tried approaches to this problem.
One thing I might do is not use #upToEnd, but read each line usingPositionableStream>>#nextLine and first check each line if the numberof separators matches the number of fieldAccessors minus 1 (and gothrough the hoops of handling separators in quoted fields andsuch...). Only if that test succeeds, I would then hand a Stream withthe whole line to the reader and do a #next.
This will, however, mean a lot of extra cycles for large files. Ofcourse I could do this only for some lines, maybe just the first one.Whatever.
But somehow I have the feeling I should get an exception telling methe line is not compatible to the Reader's definition or such. Or#readAtEndOrEndOfLine should just walk the line to the end and ignorethe rest of the line, returnong an incomplete object....
Maybe I am just missing the right setting or switch? What bestpractices did you guys come up with for such problems?
Thanks in advance,


Joachim


--
-----------------------------------------------------------------------
Objektfabrik Joachim Tuchel          mailto:jtuc...@objektfabrik.de
Fliederweg 1                         http://www.objektfabrik.de
D-71640 Ludwigsburg                  http://joachimtuchel.wordpress.com
Telefon: +49 7141 56 10 86 0         Fax: +49 7141 56 10 86 1

Object subclass: #AwesomeBusinessObject
        instanceVariableNames: 'shortText value count longText'
        classVariableNames: ''
        package: 'NeoCSVEndlessLoopTest'!

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
longText: anObject
        longText := anObject! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
shortText
        ^ shortText! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
shortText: anObject
        shortText := anObject! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
value: anObject
        value := anObject! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
count: anObject
        count := anObject! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:05'!
value
        ^ value! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
longText
        ^ longText! !

!AwesomeBusinessObject methodsFor: 'accessing' stamp: 'JoachimTuchel 1/4/2021 
14:04'!
count
        ^ count! !


TestCase subclass: #NeoCSVEndlessLoopTestCase
        instanceVariableNames: 'reader'
        classVariableNames: ''
        package: 'NeoCSVEndlessLoopTest'!

!NeoCSVEndlessLoopTestCase methodsFor: 'running' stamp: 'JoachimTuchel 1/4/2021 
14:42'!
setUp

        super setUp.
        
        reader := NeoCSVReader new.
        reader separator: $;.
        reader recordClass: AwesomeBusinessObject .
        
        reader addField: #shortText:.
"       reader addField: #value: converter: [:inp| ScaledDecimal fromString: 
inp].
"       reader addField: #count:.
        reader addField: #longText:.
        
        reader on: (ReadStream on: self input).! !

!NeoCSVEndlessLoopTestCase methodsFor: 'running' stamp: 'JoachimTuchel 1/4/2021 
14:08'!
input

        ^'"Line1";"15.0s2";3000;"Smalltalk is cool"
"Line2";"25.3s3";1000;"JavaScript is said to be cool"
"Line 3";"1.0s2";8000;"Python seems to wipe the all from the table"'! !

!NeoCSVEndlessLoopTestCase methodsFor: 'running' stamp: 'JoachimTuchel 1/4/2021 
14:07'!
testWrongNumberOfObjects


        | objects |

        objects := reader upToEnd.
        self assert: (objects size = 3).
        self assert: (objects allSatisfy:  [:ea | ea longText isEmptyOrNil 
not])! !

[Pharo-users] Re: NeoCSVReader and wrong number of fieldAccessors

Reply via email to