I've noticed this as well and am curious if there is anything more people
can say.

My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly, and
thus creating a lot of communication overhead per amount of data processed.
Just a theory though.

El miércoles, 7 de octubre de 2015, Yadid Ayzenberg <ya...@media.mit.edu>
escribió:

> Additional missing relevant information:
>
> Im running a transformation, there are no Shuffles occurring and at the
> end im performing a lookup of 4 partitions on the driver.
>
>
>
> On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
>
> Hi All,
>
> Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
> of data). The RDD is partitioned into 2048 partitions which are more or
> less equal and entirely cached in RAM.
> I evaluated the performance on several cluster sizes, and am witnessing a
> non linear (power) performance improvement as the cluster size increases
> (plot below). Each node has 4 cores and each worker is configured to use
> 10GB or RAM.
>
> [image: Spark performance]
>
> I would expect a more linear response given the number of partitions and
> the fact that all of the data is cached.
> Can anyone suggest what I should tweak in order to improve the performance?
> Or perhaps provide an explanation as to the behavior Im witnessing?
>
> Yadid
>
>
>

Reply via email to