Thank you for sharing, Emil.
> I willing to help up to develop a fix, but might need some guidance of
> how this case could be handled better.
Could you file an official Apache JIRA for your finding and
propose a PR for that too with the test case? We can continue
our discussion on your PR.
Dong
As noted in SPARK-34939 there is race when using broadcast for map
output status. Explanation from SPARK-34939
> After map statuses are broadcasted and the executors obtain
serialized broadcasted map statuses. If any fetch failure happens after,
Spark scheduler invalidates cached map statuses