Hello Kafka Users, (I apologize if this landed in your inbox twice, I sent it yesterday afternoon but it never showed up in the archive so I'm sending again just in case.)
I've been floating this question around the #apache-kafka IRC channel on Freenode for the last week or two and I still haven't reached a satisfying answer. The basic question is: How does Kafka Steams merge partial results? So let me expand on that a bit... Consider the following word count example in the official Kafka Streams repo (Github mirror): https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46 Now suppose we run this topology against a Kafka topic with 2 partitions. Into the first partition, we insert the sentence "hello world". Into the second partition, we insert the sentence "goodbye world". We know a priori that the result of this computation is something like: { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted log/KTable state And indeed we'd probably see precisely that result from a *single* consumer process that sees *all* the data. However, my question is, what if I have 1 consumer per topic partition (2 consumers total in the above hypothetical)? Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and consumer 2 would emit { goodbye: 1, world: 1 }... But the true answer requires and additional reduction of duplicate keys (in this case with a sum operator, but that needn't be the case for arbitrary aggregations). So again my question is, how are the partial results that each consumer generates merged into a final result? A simple way to accomplish this would be to produce an intermediate topic that is keyed by the word, then aggregate that (since each consumer would see all the data for a given key), but if that's happening it's not explicit anywhere in the example. So what mechanism is Kafka Streams using internally to aggregate the results of a partitioned stream across multiple consumers? (Perhaps groupByKey creating an anonymous intermediate topic?) I know that's a bit wordy, but I want to make sure my question is extremely clear. If I've still fumbled on that, let me know and I will try to be even more explicit. :) Cheers, Michael-Keith Bernard P.S. Kafka is awesome and Kafka Streams look even awesome-er!