Hello All,

Considering the following streaming dataflow of the example WordCount, I
want to understand how Sink is parallelised.


Source --> flatMap --> groupBy(), sum() --> Sink

If I set the paralellism at runtime using -p, as shown here
https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots

I want to understand how Sink is done parallelly and how the global result
is distributed.

As far as I understood groupBy(0) is applied to the tuples<String, Integer>
emitted from the flatMap funtion, which groupes by the String value and
sum(1) aggregates the Integer value getting the count.

That means streams will be redistributed so that tuples grouped by the same
String value be sent to one taskmanager and the Sink step should be writing
the results to the specified path. When Sink step is also parallelised then
each taskmanager should emit a chunk. These chunks put together must be the
global result.

But when I see the pictorial representation it seems that each task slot
will run a copy of the streaming dataflow and will be performing the
operations on the chunk of data it gets and outputs the result. But if this
is the case the global result would have duplicates of strings and would be
wrong.

Could one of you kindly clarify what exactly happens?

Kind Regards,
Ravinder Kaur

Reply via email to