Mentioning 100TB "in my context" is more like "saving current state" at some point of time to "backup" or "direct access" storage and continue with next 100TB/hours/days of streamed data. So - no, it is not about a finite data set.
On Mon, May 23, 2016 at 11:13 AM, Matthias J. Sax <mj...@apache.org> wrote: > Are you talking about a streaming or a batch job? > > You are mentioning a "text stream" but also say you want to stream 100TB > -- indicating you have a finite data set using DataSet API. > > -Matthias > > On 05/22/2016 09:50 PM, Xtra Coder wrote: > > Hello, > > > > Question from newbie about how Flink's WordCount will actually work at > > scale. > > > > I've read/seen rather many high-level presentations and do not see > > more-or-less clear answers for following … > > > > Use-case: > > -------------- > > there is huuuge text stream with very variable set of words – let's say > > 1BLN of unique words. Storing them just as raw text, without > > supplementary data, will take roughly 16TB of RAM. How Flink is > > approaching this internally. > > > > Here I'm more interested in following: > > 1. How individual words are spread in cluster of Flink nodes? > > Will each word appear exactly in one node and will be counted there or > > ... I'm not sure about the variants > > > > 2. As far as I understand – while job is running all its intermediate > > aggregation results are stored in-memory across cluster nodes (which may > > be partially written to local drive). > > Wild guess - what size of cluster is required to run above mentioned > > tasks efficiently? > > > > And two functional question on top of this ... > > > > 1. Since intermediate results are in memory – I guess it should be > > possible to get “current” counter for any word being processed. > > Is this possible? > > > > 2. After I've streamed 100TB of text – what will be the right way to > > save result to HDFS. For example I want to save list of words ordered by > > key with portions of 10mln per file compressed with bzip2. > > What APIs I should use? > > Since Flink uses intermediate snapshots for falt-tolerance - is it > > possible to save whole "current" state without stopping the stream? > > > > Thanks. > >