Yup - that is correct. Thanks for clarifying.
On Wed, Jul 16, 2014 at 10:12 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Hey Reynold, just to clarify, users will still have to manually broadcast > objects that they want to use *across* operations (e.g. in multiple > iterations of an algorithm, or multiple map functions, or stuff like that). > But they won't have to broadcast something they only use once. > > Matei > > On Jul 16, 2014, at 10:07 PM, Reynold Xin <r...@databricks.com> wrote: > > > Oops - the pull request should be > https://github.com/apache/spark/pull/1452 > > > > > > On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin <r...@databricks.com> > wrote: > > > >> Hi Spark devs, > >> > >> Want to give you guys a heads up that I'm working on a small (but major) > >> change with respect to how task dispatching works. Currently (as of > Spark > >> 1.0.1), Spark sends RDD object and closures using Akka along with the > task > >> itself to the executors. This is however inefficient because all tasks > in > >> the same stage use the same RDDs and closures, but we have to send these > >> closures and RDDs multiple times to the executors. This is especially > bad > >> when some closure references some variable that is very large. The > current > >> design led to users having to explicitly broadcast large variables. > >> > >> The patch uses broadcast to send RDD objects and the closures to > >> executors, and use Akka to only send a reference to the broadcast > >> RDD/closure along with the partition specific information for the task. > For > >> those of you who know more about the internals, Spark already relies on > >> broadcast to send the Hadoop JobConf every time it uses the Hadoop > input, > >> because the JobConf is large. > >> > >> The user-facing impact of the change include: > >> > >> 1. Users won't need to decide what to broadcast anymore > >> 2. Task size will get smaller, resulting in faster scheduling and higher > >> task dispatch throughput. > >> > >> In addition, the change will simplify some internals of Spark, removing > >> the need to maintain task caches and the complex logic to broadcast > JobConf > >> (which also led to a deadlock recently). > >> > >> > >> Pull request attached: https://github.com/apache/spark/pull/1450 > >> > >> > >> > >> > >> > >