ivoson opened a new pull request, #52335:
URL: https://github.com/apache/spark/pull/52335

   ### What changes were proposed in this pull request?
   Currently there are a few blocking RPC requests used in 
`DAGSchedulerEventProcessLoop`:
   1. 
[blockManagerMaster.getLocations(blockIds)](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L451)
 used in `getCacheLocs` to get rdd cache locations.
   2. 
[blockManagerMaster.removeShufflePushMergerLocation](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2754)
 used in `removeExecutorAndUnregisterOutputs`.
   3. 
[blockManagerMaster.removeExecutor(execId)](https://github.com/apache/spark/blob/fbdad297f54200b686f437a6c25fd1c387d1aaa0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2765)
 used in `removeExecutorAndUnregisterOutputs`.
   
   `RpcTimeoutException` could be thrown if there are slow events blocking the 
`BlockManagerMasterEndpoint`, and the exception is not handled in the 
`DAGSchedulerEventProcessLoop`. Once this happens, the DAGScheduler will exit.
   
   This PR proposes to catch and handle the `RpcTimeoutException` properly 
instead of crashing the application. There are 2 scenarios:
   1. Change the requests in `removeExecutorAndUnregisterOutputs` to be async 
since we don't rely on the response, and let `BlockManagerMasterEndpoint` to 
deal with the potential errors.
   2. Abort the stage if rpc timeout happens while `submitStage`.
   
   ### Why are the changes needed?
   Avoid rpc timeout crashing spark application.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   UT added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to