Hello Kafka Users,

(I apologize if this landed in your inbox twice, I sent it yesterday afternoon 
but it never showed up in the archive so I'm sending again just in case.)

I've been floating this question around the #apache-kafka IRC channel on 
Freenode for the last week or two and I still haven't reached a satisfying 
answer. The basic question is: How does Kafka Steams merge partial results? So 
let me expand on that a bit...

Consider the following word count example in the official Kafka Streams repo 
(Github mirror): 
https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46

Now suppose we run this topology against a Kafka topic with 2 partitions. Into 
the first partition, we insert the sentence "hello world". Into the second 
partition, we insert the sentence "goodbye world". We know a priori that the 
result of this computation is something like:

{ hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted log/KTable 
state

And indeed we'd probably see precisely that result from a *single* consumer 
process that sees *all* the data. However, my question is, what if I have 1 
consumer per topic partition (2 consumers total in the above hypothetical)? 
Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and consumer 
2 would emit { goodbye: 1, world: 1 }... But the true answer requires and 
additional reduction of duplicate keys (in this case with a sum operator, but 
that needn't be the case for arbitrary aggregations).

So again my question is, how are the partial results that each consumer 
generates merged into a final result? A simple way to accomplish this would be 
to produce an intermediate topic that is keyed by the word, then aggregate that 
(since each consumer would see all the data for a given key), but if that's 
happening it's not explicit anywhere in the example. So what mechanism is Kafka 
Streams using internally to aggregate the results of a partitioned stream 
across multiple consumers? (Perhaps groupByKey creating an anonymous 
intermediate topic?)

I know that's a bit wordy, but I want to make sure my question is extremely 
clear. If I've still fumbled on that, let me know and I will try to be even 
more explicit. :)

Cheers,
Michael-Keith Bernard

P.S. Kafka is awesome and Kafka Streams look even awesome-er!

Reply via email to