Re: Repartition and Worker Instances

Sameer Farooqui Mon, 23 Feb 2015 09:40:00 -0800

In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting. Generally over subscribe this. So if you have 10 free CPU cores,
set num_cores to 20.


On Monday, February 23, 2015, Deep Pradhan <[email protected]>
wrote:

> How is task slot different from # of Workers?
>
>
> >> so don't read into any performance metrics you've collected to
> extrapolate what may happen at scale.
> I did not get you in this.
>
> Thank You
>
> On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> In general you should first figure out how many task slots are in the
>> cluster and then repartition the RDD to maybe 2x that #. So if you have a
>> 100 slots, then maybe RDDs with partition count of 100-300 would be normal.
>>
>> But also size of each partition can matter. You want a task to operate on
>> a partition for at least 200ms, but no longer than around 20 seconds.
>>
>> Even if you have 100 slots, it could be okay to have a RDD with 10,000
>> partitions if you've read in a large file.
>>
>> So don't repartition your RDD to match the # of Worker JVMs, but rather
>> align it to the total # of task slots in the Executors.
>>
>> If you're running on a single node, shuffle operations become almost free
>> (because there's no network movement), so don't read into any
>> performance metrics you've collected to extrapolate what may happen at
>> scale.
>>
>>
>> On Monday, February 23, 2015, Deep Pradhan <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Hi,
>>> If I repartition my data by a factor equal to the number of worker
>>> instances, will the performance be better or worse?
>>> As far as I understand, the performance should be better, but in my case
>>> it is becoming worse.
>>> I have a single node standalone cluster, is it because of this?
>>> Am I guaranteed to have a better performance if I do the same thing in a
>>> multi-node cluster?
>>>
>>> Thank You
>>>
>>
>

Re: Repartition and Worker Instances

Reply via email to