ivoson opened a new pull request, #52335: URL: https://github.com/apache/spark/pull/52335
### What changes were proposed in this pull request? Currently there are a few blocking RPC requests used in `DAGSchedulerEventProcessLoop`: 1. [blockManagerMaster.getLocations(blockIds)](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L451) used in `getCacheLocs` to get rdd cache locations. 2. [blockManagerMaster.removeShufflePushMergerLocation](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2754) used in `removeExecutorAndUnregisterOutputs`. 3. [blockManagerMaster.removeExecutor(execId)](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2765) used in `removeExecutorAndUnregisterOutputs`. `RpcTimeoutException` could be thrown if there are slow events blocking the `BlockManagerMasterEndpoint`, and the exception is not handled in the `DAGSchedulerEventProcessLoop`. Once this happens, the DAGScheduler will exit. This PR proposes to catch and handle the `RpcTimeoutException` properly instead of crashing the application. There are 2 scenarios: 1. Change the requests in `removeExecutorAndUnregisterOutputs` to be async since we don't rely on the response, and let `BlockManagerMasterEndpoint` to deal with the potential errors. 2. Abort the stage if rpc timeout happens while `submitStage`. ### Why are the changes needed? Avoid rpc timeout crashing spark application. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT added. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org