Re: How distributed countByKey works in KStream ?

Michael Noll Mon, 29 Aug 2016 11:25:46 -0700

In Kafka Streams, data is partitioned according to the keys of the
key-value records, and operations such as countByKey operate on these
stream partitions.  When reading data from Kafka, these stream partitions
map to the partitions of the Kafka input topic(s), but these may change
once you add processing operations.

To your question:  The first step, if the data isn't already keyed as
needed, is to select the key you want to count by, which results in 1+
output stream partitions.  Here, data may get shuffled across the network
(but if won't if there's no need to, e.g. when the data is already keyed as
needed).  Then the count operation is performed for each stream partition,
which is similar to the sort-and-reduce phase in Hadoop.

On Mon, Aug 29, 2016 at 5:31 PM, Tommy Go <deeplam...@gmail.com> wrote:

> Hi,
>
> For "word count" example in Hadoop, there are shuffle-sort-and-reduce
> phases that handles outputs from different mappers, how does it work in
> KStream ?
>

Re: How distributed countByKey works in KStream ?

Reply via email to