Hello, I am doing performance testing with Spark Streaming. I want to know if the throughput numbers I am encountering are reasonable for the power of my cluster and Spark's performance characteristics.
My job has the following processing steps: 1. Read 600 Byte JSON strings from a 7 broker / 48 partition Kafka cluster via the Kafka Direct API 2. Parse the JSON with play-json or lift-json (no significant performance difference) 3. Read one integer value out of the JSON 4. Compute the average of this integer value across all records in the batch with DoubleRDD.mean 5. Write the average for the batch back to a different Kafka topic I have tried 2, 4, and 10 second batch intervals. The best throughput I can sustain is about 75,000 records/second for the whole cluster. The Spark cluster is in a VM environment with 3 VMs. Each VM has 32 GB of RAM and 16 cores. The systems are networked with 10 GB NICs. I started testing with Spark 1.3.1 and switched to Spark 1.5 to see if there was improvement (none significant). When I look at the event timeline in the WebUI I see that the majority of the processing time for each batch is "Executor Computing Time" in the foreachRDD that computes the average, not the transform that does the JSON parsing. CPU util hovers around 40% across the cluster, and RAM has plenty of free space remaining as well. Network comes nowhere close to being saturated. My colleague implementing similar functionality in Storm is able to exceed a quarter million records per second with the same hardware. Is 75K records/seconds reasonable for a cluster of this size? What kind of performance would you expect for this job? Thanks, -- Matthew