Re: Best way to process data in many files? (FLINK-BATCH)

Tim Conrad Tue, 23 Feb 2016 06:45:03 -0800

Hi Till (and others).

Thank you very much for your helpful answer.


On 23.02.2016 14:20, Till Rohrmann wrote:

[...] In contrast, if you had a parallel data source which wouldconsist of multiple source task, then these tasks would be independentand spread out across your cluster [...]

Can you please send me a link to an example or to the respective FlinkAPI doc, where I can see which is a parallel data source and how tocreate it with multiple source tasks?

A simple Google search did not provide me with an answer (maybe I usedthe wrong key words, though...).



Cheers
Tim




On 23.02.2016 14:20, Till Rohrmann wrote:

Hi Tim,
depending on how you create the |DataSource<String> fileList|, Flinkwill schedule the downstream operators differently. If you used the|ExecutionEnvironment.fromCollection| method, then it will create a|DataSource| with a |CollectionInputFormat|. This kind of |DataSource|will only be executed with a degree of parallelism of 1. The sourcewill send it’s collection elements in a round robin fashion to thedownstream operators which are executed with a higher parallelism. Sowhen Flink schedules the downstream operators, it will try to placethem close to their inputs. Since all flat map operators have thesingle data source task as an input, they will be deployed on the samemachine if possible.
In contrast, if you had a parallel data source which would consist ofmultiple source task, then these tasks would be independent and spreadout across your cluster. In this case, every flat map task would havea single distinct source task as input. When the flat map tasks aredeployed they would be deployed on the machine where theircorresponding source is running. Since the source tasks are spread outacross the cluster, the flat map tasks would be spread out as well.
What you could do to mitigate your problem is to start the clusterwith as many slots as your maximum degree of parallelism is. That way,you’ll utilize all cluster resources.
I hope this clarifies a bit why you observe that tasks tend to clusteron a single machine.
Cheers,
Till

Re: Best way to process data in many files? (FLINK-BATCH)

Reply via email to