Hello,

I am doing performance testing with Spark Streaming. I want to know if the 
throughput numbers I am encountering are reasonable for the power of my cluster 
and Spark's performance characteristics.

My job has the following processing steps:

1.      Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster 
via the Kafka Direct API

2.      Parse the JSON with play-json or lift-json (no significant performance 
difference)

3.      Read one integer value out of the JSON

4.      Compute the average of this integer value across all records in the 
batch with DoubleRDD.mean

5.      Write the average for the batch back to a different Kafka topic

I have tried 2, 4, and 10 second batch intervals. The best throughput I can 
sustain is about 75,000 records/second for the whole cluster.

The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM 
and 16 cores. The systems are networked with 10 GB NICs. I started testing with 
Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none 
significant). When I look at the event timeline in the WebUI I see that the 
majority of the processing time for each batch is "Executor Computing Time" in 
the foreachRDD that computes the average, not the transform that does the JSON 
parsing.

CPU util hovers around 40% across the cluster, and RAM has plenty of free space 
remaining as well. Network comes nowhere close to being saturated.

My colleague implementing similar functionality in Storm is able to exceed a 
quarter million records per second with the same hardware.

Is 75K records/seconds reasonable for a cluster of this size? What kind of 
performance would you expect for this job?


Thanks,

-- Matthew

Reply via email to