Purpose of broadcast timeout

Justin Uang Wed, 30 Jan 2019 09:27:24 -0800

Hi all,

We have noticed a lot of broadcast timeouts on our pipelines, and from some
inspection, it seems that they happen when I have two threads trying to
save two different DataFrames. We use the FIFO scheduler, so if I launch a
job that needs all the executors, the second DataFrame's collect on the
broadcast side is guaranteed to take longer than 5 minutes, and will throw.


My question is why do we have a timeout on a collect when broadcasting? It
seems silly that we have a small default timeout on something that is
influenced by contention on the cluster. We are basically saying that all
broadcast jobs need to finish in 5 minutes, regardless of our scheduling
policy on the cluster.

I'm curious about the original intention of the broadcast timeout. Perhaps
is the broadcast timeout really meant to be a timeout on
sparkContext.broadcast, instead of the child.executeCollectIterator()? In
that case, would it make sense to move the timeout to wrap only
sparkContext.broadcast?

Best,

Justin

Purpose of broadcast timeout

Reply via email to