Hi all,

We have noticed a lot of broadcast timeouts on our pipelines, and from some
inspection, it seems that they happen when I have two threads trying to
save two different DataFrames. We use the FIFO scheduler, so if I launch a
job that needs all the executors, the second DataFrame's collect on the
broadcast side is guaranteed to take longer than 5 minutes, and will throw.

My question is why do we have a timeout on a collect when broadcasting? It
seems silly that we have a small default timeout on something that is
influenced by contention on the cluster. We are basically saying that all
broadcast jobs need to finish in 5 minutes, regardless of our scheduling
policy on the cluster.

I'm curious about the original intention of the broadcast timeout. Perhaps
is the broadcast timeout really meant to be a timeout on
sparkContext.broadcast, instead of the child.executeCollectIterator()? In
that case, would it make sense to move the timeout to wrap only
sparkContext.broadcast?

Best,

Justin

Reply via email to