Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
of data). The RDD is partitioned into 2048 partitions which are more or
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing
a non linear (power) performance improvement as the cluster size
increases (plot below). Each node has 4 cores and each worker is
configured to use 10GB or RAM.
Spark performance
I would expect a more linear response given the number of partitions and
the fact that all of the data is cached.
Can anyone suggest what I should tweak in order to improve the performance?
Or perhaps provide an explanation as to the behavior Im witnessing?
Yadid