Hi all, We have noticed a lot of broadcast timeouts on our pipelines, and from some inspection, it seems that they happen when I have two threads trying to save two different DataFrames. We use the FIFO scheduler, so if I launch a job that needs all the executors, the second DataFrame's collect on the broadcast side is guaranteed to take longer than 5 minutes, and will throw.
My question is why do we have a timeout on a collect when broadcasting? It seems silly that we have a small default timeout on something that is influenced by contention on the cluster. We are basically saying that all broadcast jobs need to finish in 5 minutes, regardless of our scheduling policy on the cluster. I'm curious about the original intention of the broadcast timeout. Perhaps is the broadcast timeout really meant to be a timeout on sparkContext.broadcast, instead of the child.executeCollectIterator()? In that case, would it make sense to move the timeout to wrap only sparkContext.broadcast? Best, Justin