Re: Kafka Streams: Merging of partial results

Matthias J. Sax Thu, 21 Jul 2016 07:34:57 -0700

Hi,

you answered your question absolutely correctly by yourself (ie,
internal topic creating in groupBy() to repartition the data on words).
I cannot add more to explain how it works.


You might want to have a look here for more details about Kafka Streams
in general:

http://docs.confluent.io/3.0.0/streams/index.html


-Matthias


On 07/21/2016 04:16 PM, Michael-Keith Bernard wrote:
> Hello Kafka Users,
> 
> (I apologize if this landed in your inbox twice, I sent it yesterday 
> afternoon but it never showed up in the archive so I'm sending again just in 
> case.)
> 
> I've been floating this question around the #apache-kafka IRC channel on 
> Freenode for the last week or two and I still haven't reached a satisfying 
> answer. The basic question is: How does Kafka Steams merge partial results? 
> So let me expand on that a bit...
> 
> Consider the following word count example in the official Kafka Streams repo 
> (Github mirror): 
> https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46
> 
> Now suppose we run this topology against a Kafka topic with 2 partitions. 
> Into the first partition, we insert the sentence "hello world". Into the 
> second partition, we insert the sentence "goodbye world". We know a priori 
> that the result of this computation is something like:
> 
> { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted 
> log/KTable state
> 
> And indeed we'd probably see precisely that result from a *single* consumer 
> process that sees *all* the data. However, my question is, what if I have 1 
> consumer per topic partition (2 consumers total in the above hypothetical)? 
> Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and 
> consumer 2 would emit { goodbye: 1, world: 1 }... But the true answer 
> requires and additional reduction of duplicate keys (in this case with a sum 
> operator, but that needn't be the case for arbitrary aggregations).
> 
> So again my question is, how are the partial results that each consumer 
> generates merged into a final result? A simple way to accomplish this would 
> be to produce an intermediate topic that is keyed by the word, then aggregate 
> that (since each consumer would see all the data for a given key), but if 
> that's happening it's not explicit anywhere in the example. So what mechanism 
> is Kafka Streams using internally to aggregate the results of a partitioned 
> stream across multiple consumers? (Perhaps groupByKey creating an anonymous 
> intermediate topic?)
> 
> I know that's a bit wordy, but I want to make sure my question is extremely 
> clear. If I've still fumbled on that, let me know and I will try to be even 
> more explicit. :)
> 
> Cheers,
> Michael-Keith Bernard
> 
> P.S. Kafka is awesome and Kafka Streams look even awesome-er!
>

signature.asc
Description: OpenPGP digital signature

Re: Kafka Streams: Merging of partial results

Reply via email to