FYI, I recently revisited state-of-the-art CSV parsing libraries for Emma. I think this blog post might be useful
https://github.com/uniVocity/csv-parsers-comparison The uniVocity parsers library seems to be dominating the benchmarks and is feature complete. As far as I can tell at the moment uniVocity is also the only library backing Spark's DataFrame / Dataset CSV support, which for some time supported multiple parsing library backends. Regards, Alexander On Fri, Mar 10, 2017 at 11:17 PM Flavio Pompermaier <pomperma...@okkam.it> wrote: > If you already have an idea on how to proceed maybe I can try to take care > of issue a PR using commons-csv or whatever library you prefer > > On 10 Mar 2017 22:07, "Fabian Hueske" <fhue...@gmail.com> wrote: > > Hi Flavio, > > Flink's CsvInputFormat was originally meant to be an efficient way to parse > structured text files and dates back to the very early days of the project > (probably 2011 or so). > It was never meant to be compliant with the RFC specification and initially > didn't support many features like quoting, quote escaping, etc. Some of > these were later added but others not. > > I agree that the requirements for the CsvInputFormat have changed as more > people are using the project and that a standard compliant parser would be > desirable. > We could definitely look into using an existing library for the parsing, > but it would still need to be integrated with the way that Flink's > InputFormats work. For instance, you're approach isn't standard compliant > either, because TextInputFormat is not aware of quotes and would break > records with quoted record delimiters (FLINK-6016 [1]). > > I would be OK with having a less efficient format which is not based on the > current implementation but which is standard compliant. > IMO that would be a very useful contribution. > > Best, Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-6016 > > > > > > 2017-03-10 11:28 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>: > > > Hi to all, > > I want to discuss with the dev group something about CSV parsing. > > Since I started using Flink with CSVs I always faced some little problem > > here and there and the new tickets about the CSV parsing seems to confirm > > that this part is still problematic. > > In my production jobs I gave up using Flink CSV parsing in favour of > apace > > commons-csv and it works great. It's perfectly configurable ans robust. > > A working example is available at [1]. > > > > Thus, why not to use that library directly and contribute back (if > needed) > > to another apache library if improvements are required to speed up the > > parsing? Have you ever tried to compare the performances of the 2 > parsers? > > > > Best, > > Flavio > > > > [1] > > https://github.com/okkam-it/flink-examples/blob/master/ > > src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/ > > Csv2RowExample.java > > >