Re: Reading separate files in parallel tasks as input

Márton Balassi Sun, 14 Jun 2015 10:04:17 -0700

Hi Dani,

The batch API does not expose an addSourse-like method, but you can always
write your own inputformat and pass that directly to constructor of the
DataSource. DataSource extends DataSet, so you will get all the usual
methods in the end. For an example you can have a look e.g. here. [1]


[1]
https://github.com/dataArtisans/flink-dataflow/blob/master/src/main/java/com/dataartisans/flink/dataflow/translation/FlinkTransformTranslators.java#L133

Best,

Marton

On Sun, Jun 14, 2015 at 4:34 PM, Dániel Bali <balijanosdan...@gmail.com>
wrote:

> Hello!
>
> We are running an experiment on a cluster and we have a large input split
> into multiple files. We'd like to run a Flink job that reads the local file
> on each instance and processes that. Is there a way to do this in the batch
> environment? `readTextFile` wants to read the file on the JobManager and
> split that right there, which is not what we want.
>
> We solved it in the streaming environment by using `addSource`, but there
> is no similar function in the batch version. Does anybody know how this
> could be done?
>
> Thanks!
> Daniel
>

Re: Reading separate files in parallel tasks as input

Reply via email to