Hi All,I was running the following test:*Setup*9 VM runing spark workers with 1 spark executor each.1 VM running kafka and spark master.Spark version is 1.6.0Kafka version is 0.9.0.1Spark is using its own resource manager and is not running over YARN.*Test*I created a kafka topic with 3 partition. next I used "KafkaUtils.createDirectStream" to get a DStream./JavaPairInputDStream<String, String> stream = KafkaUtils.createDirectStream(…);JavaDStream stream1 = stream.map(func1);stream1.print();/where func1 just contains a sleep followed by returning of value.*Observation*First RDD partition corresponding to partition 1 of kafka was processed on one of the spark executor. Once processing is finished, then RDD partitions corresponding to remaining two kafka partitions were processed in parallel on different spark executors.I expected that all three RDD partitions should have been processed in parallel as there were spark executors available which were lying idle.I re-ran the test after increasing the partitions of kafka topic to 5. This time also RDD partition corresponding to partition 1 of kafka was processed on one of the spark executor. Once processing is finished for this RDD partition, then RDD partitions corresponding to remaining four kafka partitions were processed in parallel on different spark executors.I am not clear about why spark is waiting for operations on first RDD partition to finish, while it could process remaining partitions in parallel? Am I missing any configuration? Any help is appreciated.Thanks,Mukul
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-RDD-partitions-not-processed-in-parallel-tp26457.html Sent from the Apache Spark User List mailing list archive at Nabble.com.