Kafka + Spark streaming, RDD partitions not processed in parallel

Mukul Gupta Thu, 10 Mar 2016 19:56:58 -0800

Hi All,I was running the following test:*Setup*9 VM runing spark workers with
1 spark executor each.1 VM running kafka and spark master.Spark version is
1.6.0Kafka version is 0.9.0.1Spark is using its own resource manager and is
not running over YARN.*Test*I created a kafka topic with 3 partition. next I
used "KafkaUtils.createDirectStream" to get a
DStream./JavaPairInputDStream<String, String> stream =
KafkaUtils.createDirectStream(…);JavaDStream stream1 =
stream.map(func1);stream1.print();/where func1 just contains a sleep
followed by returning of value.*Observation*First RDD partition
corresponding to partition 1 of kafka was processed on one of the spark
executor. Once processing is finished, then RDD partitions corresponding to
remaining two kafka partitions were processed in parallel on different spark
executors.I expected that all three RDD partitions should have been
processed in parallel as there were spark executors available which were
lying idle.I re-ran the test after increasing the partitions of kafka topic
to 5. This time also RDD partition corresponding to partition 1 of kafka was
processed on one of the spark executor. Once processing is finished for this
RDD partition, then RDD partitions corresponding to remaining four kafka
partitions were processed in parallel on different spark executors.I am not
clear about why spark is waiting for operations on first RDD partition to
finish, while it could process remaining partitions in parallel? Am I
missing any configuration? Any help is appreciated.Thanks,Mukul




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-streaming-RDD-partitions-not-processed-in-parallel-tp26457.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Kafka + Spark streaming, RDD partitions not processed in parallel

Reply via email to