Flink's WordCount at scale of 1BLN of unique words

Xtra Coder Sun, 22 May 2016 12:51:07 -0700

Hello,

Question from newbie about how Flink's WordCount will actually work at
scale.


I've read/seen rather many high-level presentations and do not see
more-or-less clear answers for following …

Use-case:
--------------
there is huuuge text stream with very variable set of words – let's say
1BLN of unique words. Storing them just as raw text, without supplementary
data, will take roughly 16TB of RAM. How Flink is approaching this
internally.

Here I'm more interested in following:
1.  How individual words are spread in cluster of Flink nodes?
Will each word appear exactly in one node and will be counted there or ...
I'm not sure about the variants

2.  As far as I understand – while job is running all its intermediate
aggregation results are stored in-memory across cluster nodes (which may be
partially written to local drive).
Wild guess - what size of cluster is required to run above mentioned tasks
efficiently?

And two functional question on top of this  ...

1. Since intermediate results are in memory – I guess it should be possible
to get “current” counter for any word being processed.
Is this possible?

2. After I've streamed 100TB of text – what will be the right way to save
result to HDFS. For example I want to save list of words ordered by key
with portions of 10mln per file compressed with bzip2.
What APIs I should use?
Since Flink uses intermediate snapshots for falt-tolerance - is it possible
to save whole "current" state without stopping the stream?

Thanks.

Flink's WordCount at scale of 1BLN of unique words

Reply via email to