Re: small (yet major) change going in: broadcasting RDD to reduce task size

Reynold Xin Wed, 16 Jul 2014 22:08:30 -0700

Oops - the pull request should be https://github.com/apache/spark/pull/1452



On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin <r...@databricks.com> wrote:

> Hi Spark devs,
>
> Want to give you guys a heads up that I'm working on a small (but major)
> change with respect to how task dispatching works. Currently (as of Spark
> 1.0.1), Spark sends RDD object and closures using Akka along with the task
> itself to the executors. This is however inefficient because all tasks in
> the same stage use the same RDDs and closures, but we have to send these
> closures and RDDs multiple times to the executors. This is especially bad
> when some closure references some variable that is very large. The current
> design led to users having to explicitly broadcast large variables.
>
> The patch uses broadcast to send RDD objects and the closures to
> executors, and use Akka to only send a reference to the broadcast
> RDD/closure along with the partition specific information for the task. For
> those of you who know more about the internals, Spark already relies on
> broadcast to send the Hadoop JobConf every time it uses the Hadoop input,
> because the JobConf is large.
>
> The user-facing impact of the change include:
>
> 1. Users won't need to decide what to broadcast anymore
> 2. Task size will get smaller, resulting in faster scheduling and higher
> task dispatch throughput.
>
> In addition, the change will simplify some internals of Spark, removing
> the need to maintain task caches and the complex logic to broadcast JobConf
> (which also led to a deadlock recently).
>
>
> Pull request attached: https://github.com/apache/spark/pull/1450
>
>
>
>
>

Re: small (yet major) change going in: broadcasting RDD to reduce task size

Reply via email to