Hi Till (and others).

Thank you very much for your helpful answer.

On 23.02.2016 14:20, Till Rohrmann wrote:
[...] In contrast, if you had a parallel data source which would consist of multiple source task, then these tasks would be independent and spread out across your cluster [...]

Can you please send me a link to an example or to the respective Flink API doc, where I can see which is a parallel data source and how to create it with multiple source tasks?

A simple Google search did not provide me with an answer (maybe I used the wrong key words, though...).


Cheers
Tim




On 23.02.2016 14:20, Till Rohrmann wrote:

Hi Tim,

depending on how you create the |DataSource<String> fileList|, Flink will schedule the downstream operators differently. If you used the |ExecutionEnvironment.fromCollection| method, then it will create a |DataSource| with a |CollectionInputFormat|. This kind of |DataSource| will only be executed with a degree of parallelism of 1. The source will send it’s collection elements in a round robin fashion to the downstream operators which are executed with a higher parallelism. So when Flink schedules the downstream operators, it will try to place them close to their inputs. Since all flat map operators have the single data source task as an input, they will be deployed on the same machine if possible.

In contrast, if you had a parallel data source which would consist of multiple source task, then these tasks would be independent and spread out across your cluster. In this case, every flat map task would have a single distinct source task as input. When the flat map tasks are deployed they would be deployed on the machine where their corresponding source is running. Since the source tasks are spread out across the cluster, the flat map tasks would be spread out as well.

What you could do to mitigate your problem is to start the cluster with as many slots as your maximum degree of parallelism is. That way, you’ll utilize all cluster resources.

I hope this clarifies a bit why you observe that tasks tend to cluster on a single machine.

Cheers,
Till

​


Reply via email to