Clusters will not be fully utilized unless you set the level of parallelism
for each operation high enough. Spark automatically sets the number of
“map” tasks to run on each file according to its size. You can pass the
level of parallelism as a second argument or set the config property
*spark.default.parallelism* to change the default. In general, we recommend
2-3 tasks per CPU core in your cluster.
For example, the following code can set the partition number of data to 10
and it will be executed parallel:

val data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data,10)




2014-07-18 23:00 GMT+08:00 Shannon Quinn <squ...@gatech.edu>:

>  The default # of partitions is the # of cores, correct?
>
>
> On 7/18/14, 10:53 AM, Yanbo Liang wrote:
>
> check how many partitions in your program.
> If only one, change it to more partitions will make the execution
> parallel.
>
>
> 2014-07-18 20:57 GMT+08:00 Madhura <das.madhur...@gmail.com>:
>
>> I am running my program on a spark cluster but when I look into my UI
>> while
>> the job is running I see that only one worker does most of the tasks. My
>> cluster has one master and 4 workers where the master is also a worker.
>>
>> I want my task to complete as quickly as possible and I believe that if
>> the
>> number of tasks were to be divided equally among the workers, the job will
>> be completed faster.
>>
>> Is there any way I can customize the umber of job on each worker?
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n10160/Question.png
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Dividing-tasks-among-Spark-workers-tp10160.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>

Reply via email to