Hello, Question from newbie about how Flink's WordCount will actually work at scale.
I've read/seen rather many high-level presentations and do not see more-or-less clear answers for following … Use-case: -------------- there is huuuge text stream with very variable set of words – let's say 1BLN of unique words. Storing them just as raw text, without supplementary data, will take roughly 16TB of RAM. How Flink is approaching this internally. Here I'm more interested in following: 1. How individual words are spread in cluster of Flink nodes? Will each word appear exactly in one node and will be counted there or ... I'm not sure about the variants 2. As far as I understand – while job is running all its intermediate aggregation results are stored in-memory across cluster nodes (which may be partially written to local drive). Wild guess - what size of cluster is required to run above mentioned tasks efficiently? And two functional question on top of this ... 1. Since intermediate results are in memory – I guess it should be possible to get “current” counter for any word being processed. Is this possible? 2. After I've streamed 100TB of text – what will be the right way to save result to HDFS. For example I want to save list of words ordered by key with portions of 10mln per file compressed with bzip2. What APIs I should use? Since Flink uses intermediate snapshots for falt-tolerance - is it possible to save whole "current" state without stopping the stream? Thanks.