> On 7 Jan 2021, at 07:15, Richard O'Keefe <rao...@gmail.com> wrote:
>
> You aren't sure what point I was making?
> How about the one I actually wrote down:
> What test data was NeoCSV benchmarked with
> and can I get my hands on it?
> THAT is the point. The data points I showed (and
> many others I have not) are not satisfactory to me.
> I have been searching for CSV test collections.
> One site offered 6 files of which only one downloaded.
> I found a "benchmark suite" for CSV containing no
> actual CSV files.
> So where *else* should I look for benchmark data than
> associated with a parser people in this community are
> generally happy with that is described as "efficient"?
Did you actually read my email and look at the code ?
NeoCSVBenchmark generates its own test data.
> Is it so unreasonable to suspect that my results might
> be a fluke? Is it bad manners to assume that something
> described as efficient has tests showing that?
>
>
>
> On Wed, 6 Jan 2021 at 22:23, jtuc...@objektfabrik.de
> <jtuc...@objektfabrik.de> wrote:
> Richard,
>
> I am not sure what point you are trying to make here.
> You have something cooler and faster? Great, how about sharing?
> You could make a faster one when it doesn't convert numbers and stuff? Great.
> I guess the time will be spent after parsing in 95% of the use cases. It
> depends. And that is exactly what you are saying. The word efficient means
> nothing without context. How is that related to this thread?
>
> I think this thread mostly shows the strength of a community, especially when
> there are members who are active, friendly and highly motivated. My problem
> git solved in blazing speed without me paying anything for it. Just because
> Sven thought my problem could be other people's problem as well.
>
> I am happy with NeoCSV's speed, even if there may be more lightweigt and
> faster solutions. Tbh, my main concern with NeoCSV is not speed, but how well
> I can understand problems and fix them. I care about data types on parsing. A
> non-configurable csv parser gives me a bunch of dictionaries and Strings.
> That could be a waste of cycles and memory once you need the data as objects.
> My use case is not importing trillions of records all day, and for a few
> hundred or maybe sometimes thousands, it is good/fast enough.
>
>
> Joachim
>
>
>
>
>
> Am 06.01.21 um 05:10 schrieb Richard O'Keefe:
>> NeoCSVReader is described as efficient. What is that
>> in comparison to? What benchmark data are used?
>> Here are benchmark results measured today.
>> (5,000 data line file, 9,145,009 characters).
>> method time(ms)
>> Just read characters 410
>> CSVDecoder>>next 3415 astc's CSV reader (defaults). 1.26 x CSVParser
>> NeoCSVReader>>next 4798 NeoCSVReader (default state). 1.78 x CSVParser
>> CSVParser>>next 2701 pared-to-the-bone CSV reader. 1.00 reference.
>>
>> (10,000 data line file, 1,544,836 characters).
>> method time(ms)
>> Just read characters 93
>> CSVDecoder>>next 530 astc's CSV reader (defaults). 1.26 x CSVParser
>> NeoCSVReader>>next 737 NeoCSVReader (default state). 1.75 x CSVParser
>> CSVParser>>next 421 pared-to-the-bone CSV reader. 1.00 reference.
>>
>> CSVParser is just 78 lines and is not customisable. It really is
>> stripped to pretty much an absolute minimum. All of the parsers
>> were configured (if that made sense) to return an Array of Strings.
>> Many of the CSV files I've worked with use short records instead
>> of ending a line with a lot of commas. Some of them also have the
>> occasional stray comment off to the right, not mentioned in the header.
>> I've also found it necessary to skip multiple lines at the beginning
>> and/or end. (Really, some government agencies seem to have NO idea
>> that anyone might want to do more with a CSV file than eyeball it in
>> Excel.)
>>
>> If there is a benchmark suite I can use to improve CSVDecoder,
>> I would like to try it out.
>>
>> On Tue, 5 Jan 2021 at 02:36, jtuc...@objektfabrik.de
>> <jtuc...@objektfabrik.de> wrote:
>> Happy new year to all of you! May 2021 be an increasingly less crazy
>> year than 2020...
>>
>>
>> I have a question that sounds a bit strange, but we have two effects
>> with NeoCSVReader related to wrong definitions of the reader.
>>
>> One effect is that reading a Stream #upToEnd leads to an endless loop,
>> the other is that the Reader produces twice as many objects as there are
>> lines in the file that is being read.
>>
>> In both scenarios, the reason is that the CSV Reader has a wrong number
>> of column definitions.
>>
>> Of course that is my fault: why do I feed a "malformed" CSV file to poor
>> NeoCSVReader?
>>
>> Let me explain: we have a few import interfaces which end users can
>> define using a more or less nice assistant in our Application. The CSV
>> files they upload to our App come from third parties like payment
>> providers, banks and other sources. These change their file structures
>> whenever they feel like it and never tell anybody. So a CSV import that
>> may have been working for years may one day tear a whole web server
>> image down because of a wrong number of fieldAccessors. This is bad on
>> many levels.
>>
>> You can easily try the doubling effect at home: define a working CSV
>> Reader and comment out one of the addField: commands before you use the
>> NeoCSVReader to parse a CSV file. Say your CSV file has 3 lines with 4
>> columns each. If you remove one of the fieldAccessors, an #upToEnd will
>> yoield an Array of 6 objects rather than 3.
>>
>> I haven't found the reason for the cases where this leads to an endless
>> loop, but at least this one is clear...
>>
>> I *guess* this is due to the way #readEndOfLine is implemented. It seems
>> to not peek forward to the end of the line. I have the gut feeling
>> #peekChar should peek instead of reading the #next character form the
>> input Stream, but #peekChar has too many senders to just go ahead and
>> mess with it ;-)
>>
>> So I wonder if there are any tried approaches to this problem.
>>
>> One thing I might do is not use #upToEnd, but read each line using
>> PositionableStream>>#nextLine and first check each line if the number of
>> separators matches the number of fieldAccessors minus 1 (and go through
>> the hoops of handling separators in quoted fields and such...). Only if
>> that test succeeds, I would then hand a Stream with the whole line to
>> the reader and do a #next.
>>
>> This will, however, mean a lot of extra cycles for large files. Of
>> course I could do this only for some lines, maybe just the first one.
>> Whatever.
>>
>>
>> But somehow I have the feeling I should get an exception telling me the
>> line is not compatible to the Reader's definition or such. Or
>> #readAtEndOrEndOfLine should just walk the line to the end and ignore
>> the rest of the line, returnong an incomplete object....
>>
>>
>> Maybe I am just missing the right setting or switch? What best practices
>> did you guys come up with for such problems?
>>
>>
>> Thanks in advance,
>>
>>
>> Joachim
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> -----------------------------------------------------------------------
> Objektfabrik Joachim Tuchel
> mailto:jtuc...@objektfabrik.de
>
> Fliederweg 1
> http://www.objektfabrik.de
>
> D-71640 Ludwigsburg
> http://joachimtuchel.wordpress.com
>
> Telefon: +49 7141 56 10 86 0 Fax: +49 7141 56 10 86 1
>
>
>