Oops - the pull request should be https://github.com/apache/spark/pull/1452
On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin <r...@databricks.com> wrote: > Hi Spark devs, > > Want to give you guys a heads up that I'm working on a small (but major) > change with respect to how task dispatching works. Currently (as of Spark > 1.0.1), Spark sends RDD object and closures using Akka along with the task > itself to the executors. This is however inefficient because all tasks in > the same stage use the same RDDs and closures, but we have to send these > closures and RDDs multiple times to the executors. This is especially bad > when some closure references some variable that is very large. The current > design led to users having to explicitly broadcast large variables. > > The patch uses broadcast to send RDD objects and the closures to > executors, and use Akka to only send a reference to the broadcast > RDD/closure along with the partition specific information for the task. For > those of you who know more about the internals, Spark already relies on > broadcast to send the Hadoop JobConf every time it uses the Hadoop input, > because the JobConf is large. > > The user-facing impact of the change include: > > 1. Users won't need to decide what to broadcast anymore > 2. Task size will get smaller, resulting in faster scheduling and higher > task dispatch throughput. > > In addition, the change will simplify some internals of Spark, removing > the need to maintain task caches and the complex logic to broadcast JobConf > (which also led to a deadlock recently). > > > Pull request attached: https://github.com/apache/spark/pull/1450 > > > > >