In Kafka Streams, data is partitioned according to the keys of the key-value records, and operations such as countByKey operate on these stream partitions. When reading data from Kafka, these stream partitions map to the partitions of the Kafka input topic(s), but these may change once you add processing operations.
To your question: The first step, if the data isn't already keyed as needed, is to select the key you want to count by, which results in 1+ output stream partitions. Here, data may get shuffled across the network (but if won't if there's no need to, e.g. when the data is already keyed as needed). Then the count operation is performed for each stream partition, which is similar to the sort-and-reduce phase in Hadoop. On Mon, Aug 29, 2016 at 5:31 PM, Tommy Go <deeplam...@gmail.com> wrote: > Hi, > > For "word count" example in Hadoop, there are shuffle-sort-and-reduce > phases that handles outputs from different mappers, how does it work in > KStream ? >