Re: Reading csv-files in parallel

Fabian Hueske Tue, 08 May 2018 04:01:16 -0700

Hi,

the easiest approach is to read the CSV files linewise as regular text
files (ExecutionEnvironment.readTextFile()) and apply custom parse logic in
a MapFunction.
Then you have all freedom to deal with records of different schema.


Best, Fabian

2018-05-08 12:35 GMT+02:00 Esa Heikkinen <esa.heikki...@student.tut.fi>:

> Hi
>
>
>
> At this moment a batch query is ok.
>
>
>
> Do you know any good (Scala) examples how to query batches (different type
> of csv-files) in parallel ?
>
>
>
> Or do you have example of a custom source function, that read csv-files
> parallel ?
>
>
>
> Best, Esa
>
>
>
> *From:* Fabian Hueske <fhue...@gmail.com>
> *Sent:* Monday, May 7, 2018 3:48 PM
> *To:* Esa Heikkinen <esa.heikki...@student.tut.fi>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Reading csv-files in parallel
>
>
>
> Hi Esa,
>
> you can certainly read CSV files in parallel. This works very well in a
> batch query.
>
> For streaming queries, that expect data to be ingested in timestamp order
> this is much more challenging, because you need 1) read the files in the
> right order and 2) cannot split files (unless you guarantee that splits are
> read in the right order).
>
> The CsvTableSource does not guarantee to read files in timestamp order (it
> would have to know the timestamps in each file for that).
>
> Having files with different schema is another problem. The SQL / Table API
> require a fixed schema per table (source).
>
>
>
> The only recommendation when reading files in parallel for a streaming use
> case is to implement a custom source function and be careful when
> generating watermarks.
>
> Best, Fabian
>
>
>
> 2018-05-07 12:44 GMT+02:00 Esa Heikkinen <esa.heikki...@student.tut.fi>:
>
> Hi
>
>
>
> I would want to read many different type csv-files (time series data)
> parallel using by CsvTableSource. Is that possible in Flink application ?
> If yes, are there exist the examples about that ?
>
>
>
> If it is not, do you have any advices how to do that ?
>
>
>
> Should I combine all csv-files to one csv-file in pre-processing phase ?
> But this has little problem, because there are not same type (columns are
> different, except timestamp-column).
>
>
>
> Best, Esa
>
>
>
>
>

Re: Reading csv-files in parallel

Reply via email to