Re: balancing RDDs

Sean McNamara Wed, 25 Jun 2014 09:49:58 -0700

Yep exactly!  I’m not sure how complicated it would be to pull off.  If someone 
wouldn’t mind helping to get me pointed in the right direction I would be happy 
to look into and contribute this functionality.  I imagine this would be 
implemented in the scheduler codebase and there would be some sort of rebalance 
configuration property to enable it possibly?


Does anyone else have any thoughts on this?

Cheers,

Sean


On Jun 24, 2014, at 4:41 PM, Mayur Rustagi <mayur.rust...@gmail.com> wrote:

> This would be really useful. Especially for Shark where shift of
> partitioning effects all subsequent queries unless task scheduling time
> beats spark.locality.wait. Can cause overall low performance for all
> subsequent tasks.
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
> 
> 
> 
> On Tue, Jun 24, 2014 at 4:10 AM, Sean McNamara <sean.mcnam...@webtrends.com>
> wrote:
> 
>> We have a use case where we’d like something to execute once on each node
>> and I thought it would be good to ask here.
>> 
>> Currently we achieve this by setting the parallelism to the number of
>> nodes and use a mod partitioner:
>> 
>> val balancedRdd = sc.parallelize(
>>        (0 until Settings.parallelism)
>>        .map(id => id -> Settings.settings)
>>      ).partitionBy(new ModPartitioner(Settings.parallelism))
>>      .cache()
>> 
>> 
>> This works great except in two instances where it can become unbalanced:
>> 
>> 1. if a worker is restarted or dies, the partition will move to a
>> different node (one of the nodes will run two tasks).  When the worker
>> rejoins, is there a way to have a partition move back over to the newly
>> restarted worker so that it’s balanced again?
>> 
>> 2. drivers need to be started in a staggered fashion, otherwise one driver
>> can launch two tasks on one set of workers, and the other driver will do
>> the same with the other set.  Are there any scheduler/config semantics so
>> that each driver will take one (and only one) core from *each* node?
>> 
>> 
>> Thanks
>> 
>> Sean
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: balancing RDDs

Reply via email to