Each node can have any number of partitions. Spark will try to have a node
process partitions which are already on the node for best performance (if
you look at the list of tasks in the UI, look under the locality level
column).
As a rule of thumb, you probably want 2-3 times the number of partiti
Thanks very much for Yong's help.
Sorry that for one more issue, is it that different partitions must be in
different nodes? that is, each node would only have one partition, in cluster
mode ...
On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T"
wrote:
#yiv1938266569 #yiv193
Shuffling large amounts of data over the network is expensive, yes. The cost is
lower if you are just using a single node where no networking needs to be
involved to do the repartition (using Spark as a multithreading engine).
In general you need to do performance testing to see if a repartition