I've noticed this as well and am curious if there is anything more people
can say.
My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly, a
OK, next question then is: if this is wall-clock time for the whole
process, then, I wonder if you are just measuring the time taken by the
longest single task. I'd expect the time taken by the longest straggler
task to follow a distribution like this. That is, how balanced are the
partitions?
Are
Additional missing relevant information:
Im running a transformation, there are no Shuffles occurring and at the
end im performing a lookup of 4 partitions on the driver.
On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several
Gig
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
of data). The RDD is partitioned into 2048 partitions which are more or
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing
a non linear (power) performan