Kafka Streams: Merging of partial results

Michael-Keith Bernard Thu, 21 Jul 2016 07:24:38 -0700

Hello Kafka Users,

(I apologize if this landed in your inbox twice, I sent it yesterday afternoon 
but it never showed up in the archive so I'm sending again just in case.)

I've been floating this question around the #apache-kafka IRC channel on
Freenode for the last week or two and I still haven't reached a satisfying
answer. The basic question is: How does Kafka Steams merge partial results? So
let me expand on that a bit...

Consider the following word count example in the official Kafka Streams repo
(Github mirror):
https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46

Now suppose we run this topology against a Kafka topic with 2 partitions. Into
the first partition, we insert the sentence "hello world". Into the second
partition, we insert the sentence "goodbye world". We know a priori that the
result of this computation is something like:

{ hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted log/KTable
state

And indeed we'd probably see precisely that result from a *single* consumer
process that sees *all* the data. However, my question is, what if I have 1
consumer per topic partition (2 consumers total in the above hypothetical)?
Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and consumer
2 would emit { goodbye: 1, world: 1 }... But the true answer requires and
additional reduction of duplicate keys (in this case with a sum operator, but
that needn't be the case for arbitrary aggregations).

So again my question is, how are the partial results that each consumer
generates merged into a final result? A simple way to accomplish this would be
to produce an intermediate topic that is keyed by the word, then aggregate that
(since each consumer would see all the data for a given key), but if that's
happening it's not explicit anywhere in the example. So what mechanism is Kafka
Streams using internally to aggregate the results of a partitioned stream
across multiple consumers? (Perhaps groupByKey creating an anonymous
intermediate topic?)

I know that's a bit wordy, but I want to make sure my question is extremely
clear. If I've still fumbled on that, let me know and I will try to be even
more explicit. :)

Cheers,
Michael-Keith Bernard

P.S. Kafka is awesome and Kafka Streams look even awesome-er!

Kafka Streams: Merging of partial results

Reply via email to