Clearly your input size isn't changing. And depending on how they are
distributed on the nodes, there could be Datanode/disks contention.
The better way to model this is by scaling the input data also linearly. More
nodes should process more data in the same amount of time.
Thanks,
+Vinod
On
But I still want to fine the most efficient assignment and scale both data
and nodes as you said, for example in my result, 2 is the best, and 8 is
better than 4.
Why is it sub-linear from 2 to 4, super-linear from 4 to 8. I find it is
hard to model this result. Can you give me some hint about thi