Re: about partition number

2014-09-29 Thread Liquan Pei
Hi Anny, Much more partitions is not recommended in general as that creates a lot of small tasks. All the tasks needs to send to worker nodes for execution. Too many partitions increases task scheduling overhead. Spark uses synchronous execution model which means that all tasks in a stage need to

Re: about partition number

2014-09-29 Thread Daniel Siegmann
A "task" is the work to be done on a partition for a given stage - you should expect the number of tasks to be equal to the number of partitions in each stage, though a task might need to be rerun (due to failure or need to recompute some data). 2-4 times the cores in your cluster should be a good