Additional missing relevant information:
Im running a transformation, there are no Shuffles occurring and at the
end im performing a lookup of 4 partitions on the driver.
On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several
Gigabytes of data). The RDD is partitioned into 2048 partitions which
are more or less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am
witnessing a non linear (power) performance improvement as the cluster
size increases (plot below). Each node has 4 cores and each worker is
configured to use 10GB or RAM.
Spark performance
I would expect a more linear response given the number of partitions
and the fact that all of the data is cached.
Can anyone suggest what I should tweak in order to improve the
performance?
Or perhaps provide an explanation as to the behavior Im witnessing?
Yadid