You aren't sure what point I was making? How about the one I actually wrote down: What test data was NeoCSV benchmarked with and can I get my hands on it? THAT is the point. The data points I showed (and many others I have not) are not satisfactory to me. I have been searching for CSV test collections. One site offered 6 files of which only one downloaded. I found a "benchmark suite" for CSV containing no actual CSV files. So where *else* should I look for benchmark data than associated with a parser people in this community are generally happy with that is described as "efficient"?
Is it so unreasonable to suspect that my results might be a fluke? Is it bad manners to assume that something described as efficient has tests showing that? On Wed, 6 Jan 2021 at 22:23, jtuc...@objektfabrik.de < jtuc...@objektfabrik.de> wrote: > Richard, > > I am not sure what point you are trying to make here. > You have something cooler and faster? Great, how about sharing? > You could make a faster one when it doesn't convert numbers and stuff? > Great. I guess the time will be spent after parsing in 95% of the use > cases. It depends. And that is exactly what you are saying. The word > efficient means nothing without context. How is that related to this thread? > > I think this thread mostly shows the strength of a community, especially > when there are members who are active, friendly and highly motivated. My > problem git solved in blazing speed without me paying anything for it. Just > because Sven thought my problem could be other people's problem as well. > > I am happy with NeoCSV's speed, even if there may be more lightweigt and > faster solutions. Tbh, my main concern with NeoCSV is not speed, but how > well I can understand problems and fix them. I care about data types on > parsing. A non-configurable csv parser gives me a bunch of dictionaries and > Strings. That could be a waste of cycles and memory once you need the data > as objects. > My use case is not importing trillions of records all day, and for a few > hundred or maybe sometimes thousands, it is good/fast enough. > > > Joachim > > > > > > Am 06.01.21 um 05:10 schrieb Richard O'Keefe: > > NeoCSVReader is described as efficient. What is that > in comparison to? What benchmark data are used? > Here are benchmark results measured today. > (5,000 data line file, 9,145,009 characters). > method time(ms) > Just read characters 410 > CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x > CSVParser > NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x > CSVParser > CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference. > > (10,000 data line file, 1,544,836 characters). > method time(ms) > Just read characters 93 > CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x > CSVParser > NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x > CSVParser > CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference. > > CSVParser is just 78 lines and is not customisable. It really is > stripped to pretty much an absolute minimum. All of the parsers > were configured (if that made sense) to return an Array of Strings. > Many of the CSV files I've worked with use short records instead > of ending a line with a lot of commas. Some of them also have the > occasional stray comment off to the right, not mentioned in the header. > I've also found it necessary to skip multiple lines at the beginning > and/or end. (Really, some government agencies seem to have NO idea > that anyone might want to do more with a CSV file than eyeball it in > Excel.) > > If there is a benchmark suite I can use to improve CSVDecoder, > I would like to try it out. > > On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de < > jtuc...@objektfabrik.de> wrote: > >> Happy new year to all of you! May 2021 be an increasingly less crazy >> year than 2020... >> >> >> I have a question that sounds a bit strange, but we have two effects >> with NeoCSVReader related to wrong definitions of the reader. >> >> One effect is that reading a Stream #upToEnd leads to an endless loop, >> the other is that the Reader produces twice as many objects as there are >> lines in the file that is being read. >> >> In both scenarios, the reason is that the CSV Reader has a wrong number >> of column definitions. >> >> Of course that is my fault: why do I feed a "malformed" CSV file to poor >> NeoCSVReader? >> >> Let me explain: we have a few import interfaces which end users can >> define using a more or less nice assistant in our Application. The CSV >> files they upload to our App come from third parties like payment >> providers, banks and other sources. These change their file structures >> whenever they feel like it and never tell anybody. So a CSV import that >> may have been working for years may one day tear a whole web server >> image down because of a wrong number of fieldAccessors. This is bad on >> many levels. >> >> You can easily try the doubling effect at home: define a working CSV >> Reader and comment out one of the addField: commands before you use the >> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4 >> columns each. If you remove one of the fieldAccessors, an #upToEnd will >> yoield an Array of 6 objects rather than 3. >> >> I haven't found the reason for the cases where this leads to an endless >> loop, but at least this one is clear... >> >> I *guess* this is due to the way #readEndOfLine is implemented. It seems >> to not peek forward to the end of the line. I have the gut feeling >> #peekChar should peek instead of reading the #next character form the >> input Stream, but #peekChar has too many senders to just go ahead and >> mess with it ;-) >> >> So I wonder if there are any tried approaches to this problem. >> >> One thing I might do is not use #upToEnd, but read each line using >> PositionableStream>>#nextLine and first check each line if the number of >> separators matches the number of fieldAccessors minus 1 (and go through >> the hoops of handling separators in quoted fields and such...). Only if >> that test succeeds, I would then hand a Stream with the whole line to >> the reader and do a #next. >> >> This will, however, mean a lot of extra cycles for large files. Of >> course I could do this only for some lines, maybe just the first one. >> Whatever. >> >> >> But somehow I have the feeling I should get an exception telling me the >> line is not compatible to the Reader's definition or such. Or >> #readAtEndOrEndOfLine should just walk the line to the end and ignore >> the rest of the line, returnong an incomplete object.... >> >> >> Maybe I am just missing the right setting or switch? What best practices >> did you guys come up with for such problems? >> >> >> Thanks in advance, >> >> >> Joachim >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > -- > ----------------------------------------------------------------------- > Objektfabrik Joachim Tuchel mailto:jtuc...@objektfabrik.de > <jtuc...@objektfabrik.de> > Fliederweg 1 http://www.objektfabrik.de > D-71640 Ludwigsburg http://joachimtuchel.wordpress.com > Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1 > > > >