Re: Does flink support groupByKey([numTasks])

Fabian Hueske Sun, 13 Sep 2015 00:58:26 -0700

Hi Liang,

Martin gave a very good answer.


I'd like to add a few words regarding the different stream processing
models of Spark and Flink. Spark Streaming is based on mini-batches and
each mini-batch is processed like a batch program. Hence a Spark Streaming
program is implemented like a batch program.

In contrast, Flink pipelines data streams and processes them an infinite
sequence of data. Because data streams are infinite, you cannot simply
group their data (a group would never be closed). This issue is addressed
by using windows which discretize a stream by time or count.

Although Spark's mini-batches can be interpreted as a global time windows,
it can be tricky to implement exactly the same semantics in Flink because
the internal implementations of Spark and Flink differ.

Best, Fabian

2015-09-12 12:33 GMT+02:00 Martin Liesenberg <martin.liesenb...@gmail.com>:

> Hi,
>
> as far as I can tell there is no direct equivalent, which is probably due
> to the underlying execution models.
>
> I think the desired behaviour can be expressed by something along the lines
> of:
> stream.groupBy(0).window(Count.of(<size>))
> where:
> stream is a DataStream<Tuple2<K, V>> and <size> would be the batch size of
> your SparkStreaming job.
>
> The window can also be expressed in terms of time which would look
> something like this: .window(Time.of(<time>, <time_unit>))
>
> You can find slides on the streaming API at [1] and there is a number of
> examples at [2]
>
> best regards,
> martin
>
> [1]
> http://dataartisans.github.io/flink-training/dataStreamBasics/intro.html
> [2]
>
> https://github.com/dataArtisans/flink-training-exercises/tree/master/src/main/java/com/dataArtisans/flinkTraining/exercises/dataStreamJava
>
>
> Liang Chen <chenliang...@huawei.com> schrieb am Sa., 12. Sep. 2015 um
> 05:53 Uhr:
>
> > Hi
> >
> > Now i am considering migrate Sparkstreaming case to Flink for comparing
> > performance.
> >
> > Does flink support groupByKey([numTasks]) ,When called on a dataset of
> (K,
> > V) pairs, returns a dataset of (K, Iterable<V>) pairs.
> > If it is not exist,  how to use groupBy() to implement the same function?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Does-flink-support-groupByKey-numTasks-tp7973.html
> > Sent from the Apache Flink Mailing List archive. mailing list archive at
> > Nabble.com.
> >
>

Re: Does flink support groupByKey([numTasks])

Reply via email to