Another point: In open-source and in this community. Either the code people mentioned is open-source and accessible or it does not exist. If it does not exist then this is easy :)
S. > On 6 Jan 2021, at 05:10, Richard O'Keefe <rao...@gmail.com> wrote: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the occasional > stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de > <mailto:jtuc...@objektfabrik.de> <jtuc...@objektfabrik.de > <mailto:jtuc...@objektfabrik.de>> wrote: > Happy new year to all of you! May 2021 be an increasingly less crazy > year than 2020... > > > I have a question that sounds a bit strange, but we have two effects > with NeoCSVReader related to wrong definitions of the reader. > > One effect is that reading a Stream #upToEnd leads to an endless loop, > the other is that the Reader produces twice as many objects as there are > lines in the file that is being read. > > In both scenarios, the reason is that the CSV Reader has a wrong number > of column definitions. > > Of course that is my fault: why do I feed a "malformed" CSV file to poor > NeoCSVReader? > > Let me explain: we have a few import interfaces which end users can > define using a more or less nice assistant in our Application. The CSV > files they upload to our App come from third parties like payment > providers, banks and other sources. These change their file structures > whenever they feel like it and never tell anybody. So a CSV import that > may have been working for years may one day tear a whole web server > image down because of a wrong number of fieldAccessors. This is bad on > many levels. > > You can easily try the doubling effect at home: define a working CSV > Reader and comment out one of the addField: commands before you use the > NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 > columns each. If you remove one of the fieldAccessors, an #upToEnd will > yoield an Array of 6 objects rather than 3. > > I haven't found the reason for the cases where this leads to an endless > loop, but at least this one is clear... > > I *guess* this is due to the way #readEndOfLine is implemented. It seems > to not peek forward to the end of the line. I have the gut feeling > #peekChar should peek instead of reading the #next character form the > input Stream, but #peekChar has too many senders to just go ahead and > mess with it ;-) > > So I wonder if there are any tried approaches to this problem. > > One thing I might do is not use #upToEnd, but read each line using > PositionableStream>>#nextLine and first check each line if the number of > separators matches the number of fieldAccessors minus 1 (and go through > the hoops of handling separators in quoted fields and such...). Only if > that test succeeds, I would then hand a Stream with the whole line to > the reader and do a #next. > > This will, however, mean a lot of extra cycles for large files. Of > course I could do this only for some lines, maybe just the first one. > Whatever. > > > But somehow I have the feeling I should get an exception telling me the > line is not compatible to the Reader's definition or such. Or > #readAtEndOrEndOfLine should just walk the line to the end and ignore > the rest of the line, returnong an incomplete object.... > > > Maybe I am just missing the right setting or switch? What best practices > did you guys come up with for such problems? > > > Thanks in advance, > > > Joachim > > > > > > > > > > > > > > > > > > -------------------------------------------- Stéphane Ducasse http://stephane.ducasse.free.fr / http://www.pharo.org 03 59 35 87 52 Assistant: Aurore Dalle FAX 03 59 57 78 50 TEL 03 59 35 86 16 S. Ducasse - Inria 40, avenue Halley, Parc Scientifique de la Haute Borne, Bât.A, Park Plaza Villeneuve d'Ascq 59650 France