Re: Reading separate files in parallel tasks as input

Dániel Bali Sun, 14 Jun 2015 11:18:45 -0700

Hi Márton,

Thanks for the reply! I suppose I have to implement `createInputSplits` too
then. I tried looking at the documentation for the InputFormat interface,
but I can't see how I could force it to load separate files on separate
task managers, instead of one file on the job manager. Where is this
behavior decided? Or am I misunderstanding something about how this all
works?


Cheers,
Daniel

On Sun, Jun 14, 2015 at 7:02 PM, Márton Balassi <balassi.mar...@gmail.com>
wrote:

> Hi Dani,
>
> The batch API does not expose an addSourse-like method, but you can always
> write your own inputformat and pass that directly to constructor of the
> DataSource. DataSource extends DataSet, so you will get all the usual
> methods in the end. For an example you can have a look e.g. here. [1]
>
> [1]
> https://github.com/dataArtisans/flink-dataflow/blob/master/src/main/java/com/dataartisans/flink/dataflow/translation/FlinkTransformTranslators.java#L133
>
> Best,
>
> Marton
>
> On Sun, Jun 14, 2015 at 4:34 PM, Dániel Bali <balijanosdan...@gmail.com>
> wrote:
>
>> Hello!
>>
>> We are running an experiment on a cluster and we have a large input split
>> into multiple files. We'd like to run a Flink job that reads the local file
>> on each instance and processes that. Is there a way to do this in the batch
>> environment? `readTextFile` wants to read the file on the JobManager and
>> split that right there, which is not what we want.
>>
>> We solved it in the streaming environment by using `addSource`, but there
>> is no similar function in the batch version. Does anybody know how this
>> could be done?
>>
>> Thanks!
>> Daniel
>>
>
>

Re: Reading separate files in parallel tasks as input

Reply via email to