Hi, you answered your question absolutely correctly by yourself (ie, internal topic creating in groupBy() to repartition the data on words). I cannot add more to explain how it works.
You might want to have a look here for more details about Kafka Streams in general: http://docs.confluent.io/3.0.0/streams/index.html -Matthias On 07/21/2016 04:16 PM, Michael-Keith Bernard wrote: > Hello Kafka Users, > > (I apologize if this landed in your inbox twice, I sent it yesterday > afternoon but it never showed up in the archive so I'm sending again just in > case.) > > I've been floating this question around the #apache-kafka IRC channel on > Freenode for the last week or two and I still haven't reached a satisfying > answer. The basic question is: How does Kafka Steams merge partial results? > So let me expand on that a bit... > > Consider the following word count example in the official Kafka Streams repo > (Github mirror): > https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46 > > Now suppose we run this topology against a Kafka topic with 2 partitions. > Into the first partition, we insert the sentence "hello world". Into the > second partition, we insert the sentence "goodbye world". We know a priori > that the result of this computation is something like: > > { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted > log/KTable state > > And indeed we'd probably see precisely that result from a *single* consumer > process that sees *all* the data. However, my question is, what if I have 1 > consumer per topic partition (2 consumers total in the above hypothetical)? > Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and > consumer 2 would emit { goodbye: 1, world: 1 }... But the true answer > requires and additional reduction of duplicate keys (in this case with a sum > operator, but that needn't be the case for arbitrary aggregations). > > So again my question is, how are the partial results that each consumer > generates merged into a final result? A simple way to accomplish this would > be to produce an intermediate topic that is keyed by the word, then aggregate > that (since each consumer would see all the data for a given key), but if > that's happening it's not explicit anywhere in the example. So what mechanism > is Kafka Streams using internally to aggregate the results of a partitioned > stream across multiple consumers? (Perhaps groupByKey creating an anonymous > intermediate topic?) > > I know that's a bit wordy, but I want to make sure my question is extremely > clear. If I've still fumbled on that, let me know and I will try to be even > more explicit. :) > > Cheers, > Michael-Keith Bernard > > P.S. Kafka is awesome and Kafka Streams look even awesome-er! >
signature.asc
Description: OpenPGP digital signature