balancing RDDs

Sean McNamara Mon, 23 Jun 2014 15:41:29 -0700

We have a use case where we’d like something to execute once on each node and I 
thought it would be good to ask here.


Currently we achieve this by setting the parallelism to the number of nodes and 
use a mod partitioner:

val balancedRdd = sc.parallelize(
        (0 until Settings.parallelism)
        .map(id => id -> Settings.settings)
      ).partitionBy(new ModPartitioner(Settings.parallelism))
      .cache()


This works great except in two instances where it can become unbalanced:

1. if a worker is restarted or dies, the partition will move to a different 
node (one of the nodes will run two tasks).  When the worker rejoins, is there 
a way to have a partition move back over to the newly restarted worker so that 
it’s balanced again?

2. drivers need to be started in a staggered fashion, otherwise one driver can 
launch two tasks on one set of workers, and the other driver will do the same 
with the other set.  Are there any scheduler/config semantics so that each 
driver will take one (and only one) core from *each* node?


Thanks

Sean

balancing RDDs

Reply via email to