Slow Shuffle Operation on Empty Batch

Erwan ALLAIN Mon, 26 Sep 2016 14:11:01 -0700

Hi

I'm working with
- Kafka 0.8.2
- Spark Streaming (2.0) direct input stream.
- cassandra 3.0


My batch interval is 1s.

When I use some map, filter even saveToCassandra functions, the processing
time is around 50ms on empty batches
 => This is fine.

As soon as I use some reduceByKey, the processing time is increasing rapidly
between 3 and 4s for 3 calls of reduceByKey on empty batches.
=> Not Good

I've found a workaround by using a foreachRDD on DStream and check if rdd
is empty before executing the reduceByKey but I find this quite ugly.

Do I need to check if RDD is empty on all shuffle operation ?

Thanks for your lights

Slow Shuffle Operation on Empty Batch

Reply via email to