Hi Till (and others).
Thank you very much for your helpful answer.
On 23.02.2016 14:20, Till Rohrmann wrote:
[...] In contrast, if you had a parallel data source which would
consist of multiple source task, then these tasks would be independent
and spread out across your cluster [...]
Can you please send me a link to an example or to the respective Flink
API doc, where I can see which is a parallel data source and how to
create it with multiple source tasks?
A simple Google search did not provide me with an answer (maybe I used
the wrong key words, though...).
Cheers
Tim
On 23.02.2016 14:20, Till Rohrmann wrote:
Hi Tim,
depending on how you create the |DataSource<String> fileList|, Flink
will schedule the downstream operators differently. If you used the
|ExecutionEnvironment.fromCollection| method, then it will create a
|DataSource| with a |CollectionInputFormat|. This kind of |DataSource|
will only be executed with a degree of parallelism of 1. The source
will send it’s collection elements in a round robin fashion to the
downstream operators which are executed with a higher parallelism. So
when Flink schedules the downstream operators, it will try to place
them close to their inputs. Since all flat map operators have the
single data source task as an input, they will be deployed on the same
machine if possible.
In contrast, if you had a parallel data source which would consist of
multiple source task, then these tasks would be independent and spread
out across your cluster. In this case, every flat map task would have
a single distinct source task as input. When the flat map tasks are
deployed they would be deployed on the machine where their
corresponding source is running. Since the source tasks are spread out
across the cluster, the flat map tasks would be spread out as well.
What you could do to mitigate your problem is to start the cluster
with as many slots as your maximum degree of parallelism is. That way,
you’ll utilize all cluster resources.
I hope this clarifies a bit why you observe that tasks tend to cluster
on a single machine.
Cheers,
Till