Hi All,

Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power) performance improvement as the cluster size increases (plot below). Each node has 4 cores and each worker is configured to use 10GB or RAM.

Spark performance

I would expect a more linear response given the number of partitions and the fact that all of the data is cached.
Can anyone suggest what I should tweak in order to improve the performance?
Or perhaps provide an explanation as to the behavior Im witnessing?

Yadid

Reply via email to