Hello everyone,

I've read the Kafka Streams docs available at
http://docs.confluent.io/2.1.0-alpha1/streams/index.html. Since I'm coming
from the world of Spark, Dataflow and friends, I couldn't avoid having some
mind-breaking questions with how Kafka Streams handles its parallelism. In
spark, when using the wordcount example with the typical map(//split by
word).reduceByKey(), I know that every transformation function returns a
RDD and therefore it will be executed by a number of pre-configured workers
in parallel. However, when I was reading the docs for Kafka Streams and
stumbled upon this very same example in
https://github.com/confluentinc/examples/blob/master/kafka-streams/src/main/java/io/confluent/examples/streams/WordCountLambdaExample.java,
I couldn't see how this could be parallelized by multiple workers(different
instances or threads running the same code).

Here are my main concerns:

 1 - If Kafka Streams route messages to different partitions according to
the key of the message, how would it work in this example, where each
message contained a line with multiple words?

 2 - It is written in the docs that instances or threads running the same
Kafka Streams application, do not share any state between them. How could I
parallelize the wourdcount example taking into account that two different
workers could end up having different counts for the same key ? This
question is obviously related to question 1), suppose we have two instances
running with exactly two partitions (1 instance per partition) - There are
three messages in the source topic: "Hello this is Kafka" && "Kafka is
great!" && "Hello this is Luis", first and third messages go to the first
instance, while the second message goes to the second instance. After every
message is processed, and taking into account that the sink topic receives
KStreams and not KTables, that topic would contain the following messages
(I may be wrong):

 - Hello -> 1 || this -> 1 || is -> 1 || kafka -> 1 || kafka -> 1 (Second
instance doesn't share state and therefore our counts are WRONG) || is -> 1
(again, same) || etc etc

One solution would be to split the current example into two parts: First, I
have a Kafka Streams application with 1 partition and 1 instance (can be
parallelized), receiving the messages corresponding to lines as before,
applying the map function that splits lines into words and storing them
immediately into one topic called X. Afterwards, this topic is read by N
instances of a second KafkaStreams application whose Source is topic X and
its N partitions. This second app receives messages with the format
(word,word) and since the routing is done by key, each instance receives a
subset of the key's domain and the calculations will be done correctly.

Am I missing something? I'm not saying the solution is bad, on the
contrary, but after seeing the examples I noticed I couldn't simply start
more instances and it would continue working. It requires multiple apps in
some scenarios in order to build higher complexity applications. If so, I
think the examples should have a note that clearly states that each example
is only a quickstarter thing and is not ready to be parallelized and still
maintain the correct results.

I know I just wrote a huge question and I hope everyone fully understands
the issues I'm currently having after reading the docs. In any case, Kafka
Streams is an amazing library and I'm sure I'll be using it in the future :)

Thanks for your time!

Reply via email to