Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of partiti

Re: is repartition very cost

2015-12-08 Thread Zhiliang Zhu
Thanks very much for Yong's help. Sorry that for one more issue, is it that different partitions must be in different nodes? that is, each node would only have one partition, in cluster mode ...  On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" wrote: #yiv1938266569 #yiv193

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a repartition