I've noticed this as well and am curious if there is anything more people can say.
My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly, and thus creating a lot of communication overhead per amount of data processed. Just a theory though. El miércoles, 7 de octubre de 2015, Yadid Ayzenberg <ya...@media.mit.edu> escribió: > Additional missing relevant information: > > Im running a transformation, there are no Shuffles occurring and at the > end im performing a lookup of 4 partitions on the driver. > > > > On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: > > Hi All, > > Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes > of data). The RDD is partitioned into 2048 partitions which are more or > less equal and entirely cached in RAM. > I evaluated the performance on several cluster sizes, and am witnessing a > non linear (power) performance improvement as the cluster size increases > (plot below). Each node has 4 cores and each worker is configured to use > 10GB or RAM. > > [image: Spark performance] > > I would expect a more linear response given the number of partitions and > the fact that all of the data is cached. > Can anyone suggest what I should tweak in order to improve the performance? > Or perhaps provide an explanation as to the behavior Im witnessing? > > Yadid > > >