Re: Flink's WordCount at scale of 1BLN of unique words

Xtra Coder Tue, 24 May 2016 15:15:07 -0700

Mentioning 100TB "in my context" is more like "saving current state" at
some point of time to "backup" or "direct access" storage and continue with
next 100TB/hours/days of streamed data.
So - no, it is not about a finite data set.


On Mon, May 23, 2016 at 11:13 AM, Matthias J. Sax <mj...@apache.org> wrote:

> Are you talking about a streaming or a batch job?
>
> You are mentioning a "text stream" but also say you want to stream 100TB
> -- indicating you have a finite data set using DataSet API.
>
> -Matthias
>
> On 05/22/2016 09:50 PM, Xtra Coder wrote:
> > Hello,
> >
> > Question from newbie about how Flink's WordCount will actually work at
> > scale.
> >
> > I've read/seen rather many high-level presentations and do not see
> > more-or-less clear answers for following …
> >
> > Use-case:
> > --------------
> > there is huuuge text stream with very variable set of words – let's say
> > 1BLN of unique words. Storing them just as raw text, without
> > supplementary data, will take roughly 16TB of RAM. How Flink is
> > approaching this internally.
> >
> > Here I'm more interested in following:
> > 1.  How individual words are spread in cluster of Flink nodes?
> > Will each word appear exactly in one node and will be counted there or
> > ... I'm not sure about the variants
> >
> > 2.  As far as I understand – while job is running all its intermediate
> > aggregation results are stored in-memory across cluster nodes (which may
> > be partially written to local drive).
> > Wild guess - what size of cluster is required to run above mentioned
> > tasks efficiently?
> >
> > And two functional question on top of this  ...
> >
> > 1. Since intermediate results are in memory – I guess it should be
> > possible to get “current” counter for any word being processed.
> > Is this possible?
> >
> > 2. After I've streamed 100TB of text – what will be the right way to
> > save result to HDFS. For example I want to save list of words ordered by
> > key with portions of 10mln per file compressed with bzip2.
> > What APIs I should use?
> > Since Flink uses intermediate snapshots for falt-tolerance - is it
> > possible to save whole "current" state without stopping the stream?
> >
> > Thanks.
>
>

Re: Flink's WordCount at scale of 1BLN of unique words

Reply via email to