At Netflix, we disable the broadcast timeout in our defaults.

I found that it never helped catch problems. With lazy evaluation, I think
it is reasonable for a table that should be broadcast to take a long time
to build. Just because a join uses a subset or aggregation of a large table
or requires a join itself, doesn't mean that it isn't better for the final
plan to broadcast the data.

I'm not sure that a timeout for `sparkContext.broadcast` would be helpful
either. What bad behavior would this catch?

Let's just remove the timeout entirely, or disable it by default.

On Wed, Jan 30, 2019 at 9:27 AM Justin Uang <justin.u...@gmail.com> wrote:

> Hi all,
>
> We have noticed a lot of broadcast timeouts on our pipelines, and from
> some inspection, it seems that they happen when I have two threads trying
> to save two different DataFrames. We use the FIFO scheduler, so if I launch
> a job that needs all the executors, the second DataFrame's collect on the
> broadcast side is guaranteed to take longer than 5 minutes, and will throw.
>
> My question is why do we have a timeout on a collect when broadcasting? It
> seems silly that we have a small default timeout on something that is
> influenced by contention on the cluster. We are basically saying that all
> broadcast jobs need to finish in 5 minutes, regardless of our scheduling
> policy on the cluster.
>
> I'm curious about the original intention of the broadcast timeout. Perhaps
> is the broadcast timeout really meant to be a timeout on
> sparkContext.broadcast, instead of the child.executeCollectIterator()? In
> that case, would it make sense to move the timeout to wrap only
> sparkContext.broadcast?
>
> Best,
>
> Justin
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to