The coarsest level at which you can parallelize is topic. Topics are all but unrelated to each other so can be consumed independently. But you can parallelize within the context of a topic too.
A Kafka group ID defines a consumer group. One consumer in a group receive each message to the topic that group is listening to. Topics can have partitions too. You can thus make N consumers in a group listening to N partitions and each will effectively be listening to a partition. Yes, my understanding is that multiple receivers in one group are the way to consume a topic's partitions in parallel. On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet <cjno...@gmail.com> wrote: > Looking @ [1], it seems to recommend pull from multiple Kafka topics in > order to parallelize data received from Kafka over multiple nodes. I notice > in [2], however, that one of the createConsumer() functions takes a groupId. > So am I understanding correctly that creating multiple DStreams with the > same groupId allow data to be partitioned across many nodes on a single > topic? > > [1] > http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving > [2] > https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org