Re: small (yet major) change going in: broadcasting RDD to reduce task size

Reynold Xin Wed, 16 Jul 2014 22:15:46 -0700

Yup - that is correct.  Thanks for clarifying.


On Wed, Jul 16, 2014 at 10:12 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hey Reynold, just to clarify, users will still have to manually broadcast
> objects that they want to use *across* operations (e.g. in multiple
> iterations of an algorithm, or multiple map functions, or stuff like that).
> But they won't have to broadcast something they only use once.
>
> Matei
>
> On Jul 16, 2014, at 10:07 PM, Reynold Xin <r...@databricks.com> wrote:
>
> > Oops - the pull request should be
> https://github.com/apache/spark/pull/1452
> >
> >
> > On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >
> >> Hi Spark devs,
> >>
> >> Want to give you guys a heads up that I'm working on a small (but major)
> >> change with respect to how task dispatching works. Currently (as of
> Spark
> >> 1.0.1), Spark sends RDD object and closures using Akka along with the
> task
> >> itself to the executors. This is however inefficient because all tasks
> in
> >> the same stage use the same RDDs and closures, but we have to send these
> >> closures and RDDs multiple times to the executors. This is especially
> bad
> >> when some closure references some variable that is very large. The
> current
> >> design led to users having to explicitly broadcast large variables.
> >>
> >> The patch uses broadcast to send RDD objects and the closures to
> >> executors, and use Akka to only send a reference to the broadcast
> >> RDD/closure along with the partition specific information for the task.
> For
> >> those of you who know more about the internals, Spark already relies on
> >> broadcast to send the Hadoop JobConf every time it uses the Hadoop
> input,
> >> because the JobConf is large.
> >>
> >> The user-facing impact of the change include:
> >>
> >> 1. Users won't need to decide what to broadcast anymore
> >> 2. Task size will get smaller, resulting in faster scheduling and higher
> >> task dispatch throughput.
> >>
> >> In addition, the change will simplify some internals of Spark, removing
> >> the need to maintain task caches and the complex logic to broadcast
> JobConf
> >> (which also led to a deadlock recently).
> >>
> >>
> >> Pull request attached: https://github.com/apache/spark/pull/1450
> >>
> >>
> >>
> >>
> >>
>
>

Re: small (yet major) change going in: broadcasting RDD to reduce task size

Reply via email to