how abou this? https://groups.google.com/forum/#!topic/spark-users/ntPQUZFJt4M
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Sat, Jun 28, 2014 at 10:19 AM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > I have a number of questions using the Kafka receiver of Spark > Streaming. Maybe someone has some more experience with that and can > help me out. > > I have set up an environment for getting to know Spark, consisting of > - a Mesos cluster with 3 only-slaves and 3 master-and-slaves, > - 2 Kafka nodes, > - 3 Zookeeper nodes providing service to both Kafka and Mesos. > > My Kafka cluster has only one topic with one partition (replicated to > both nodes). When I start my Kafka receiver, it successfully connects > to Kafka and does the processing, but it seems as if the (expensive) > function in the final foreachRDD(...) is only executed on one node of > my cluster, which is not what I had in mind when setting up the > cluster ;-) > > So first, I was wondering about the parameter `topics: Map[String, > Int]` to KafkaUtils.createStream(). Apparently it controls how many > connections are made from my cluster nodes to Kafka. The Kafka doc at > https://kafka.apache.org/documentation.html#introduction says "each > message published to a topic is delivered to one consumer instance > within each subscribing consumer group" and "If all the consumer > instances have the same consumer group, then this works just like a > traditional queue balancing load over the consumers." > > The Kafka docs *also* say: "Note however that there cannot be more > consumer instances than partitions." This seems to imply that with > only one partition, increasing the number in my Map should have no > effect. > > However, if I increase the number of streams for my one topic in my > `topics` Map, I actually *do* see that the task in my foreachRDD(...) > call is now executed on multiple nodes. Maybe it's more of a Kafka > question than a Spark one, but can anyone explain this to me? Should I > always have more Kafka partitions than Mesos cluster nodes? > > So, assuming that changing the number in that Map is not what I want > (although I don't know if it is), I tried to use > .repartition(numOfClusterNodes) (which doesn't seem right if I want to > add and remove Mesos nodes on demand). This *also* did spread the > foreachRDD(...) action evenly – however, the function never seems to > terminate, so I never get to process the next interval in the stream. > A similar behavior can be observed when running locally, not on the > cluster, then the program will not exit but instead hang after > everything else has shut down. Any hints concerning this issue? > > Thanks > Tobias >