Re: Flink CSV parsing

Alexander Alexandrov Sat, 11 Mar 2017 02:12:07 -0800

FYI, I recently revisited state-of-the-art CSV parsing libraries for Emma.

I think this blog post might be useful


https://github.com/uniVocity/csv-parsers-comparison

The uniVocity parsers library seems to be dominating the benchmarks and is
feature complete.

As far as I can tell at the moment uniVocity is also the only library
backing Spark's DataFrame / Dataset CSV support, which for some time
supported multiple parsing library backends.

Regards,
Alexander

On Fri, Mar 10, 2017 at 11:17 PM Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> If you already have an idea on how to proceed maybe I can try to take care
> of issue a PR using commons-csv or whatever library you prefer
>
> On 10 Mar 2017 22:07, "Fabian Hueske" <fhue...@gmail.com> wrote:
>
> Hi Flavio,
>
> Flink's CsvInputFormat was originally meant to be an efficient way to parse
> structured text files and dates back to the very early days of the project
> (probably 2011 or so).
> It was never meant to be compliant with the RFC specification and initially
> didn't support many features like quoting, quote escaping, etc. Some of
> these were later added but others not.
>
> I agree that the requirements for the CsvInputFormat have changed as more
> people are using the project and that a standard compliant parser would be
> desirable.
> We could definitely look into using an existing library for the parsing,
> but it would still need to be integrated with the way that Flink's
> InputFormats work. For instance, you're approach isn't standard compliant
> either, because TextInputFormat is not aware of quotes and would break
> records with quoted record delimiters (FLINK-6016 [1]).
>
> I would be OK with having a less efficient format which is not based on the
> current implementation but which is standard compliant.
> IMO that would be a very useful contribution.
>
> Best, Fabian
>
> [1] https://issues.apache.org/jira/browse/FLINK-6016
>
>
>
>
>
> 2017-03-10 11:28 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
> > Hi to all,
> > I want to discuss with the dev group something about CSV parsing.
> > Since I started using Flink with CSVs I always faced some little problem
> > here and there and the new tickets about the CSV parsing seems to confirm
> > that this part is still problematic.
> > In my production jobs I gave up using Flink CSV parsing in favour of
> apace
> > commons-csv and it works great. It's perfectly configurable ans robust.
> > A working example is available at [1].
> >
> > Thus, why not to use that library directly and contribute back (if
> needed)
> > to another apache library if improvements are required to speed up the
> > parsing? Have you ever tried to compare the performances of the 2
> parsers?
> >
> > Best,
> > Flavio
> >
> > [1]
> > https://github.com/okkam-it/flink-examples/blob/master/
> > src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> > Csv2RowExample.java
> >
>

Re: Flink CSV parsing

Reply via email to