ivoson opened a new pull request, #52336: URL: https://github.com/apache/spark/pull/52336
### What changes were proposed in this pull request? This PR proposes to retry all tasks of the consumer stages, when checksum mismatches are detected on their producer stages. In the case that we can't rollback and retry all tasks of a consumer stage, we will have to abort the stage (thus the job). How do we detect and handle nondeterministic before: - Stages are labeled as indeterminate at planning time, prior to query execution - When a task completes and `FetchFailed` is detected, we will abort all unrollbackable succeeding stages of the map stage, and resubmit failed stages. - In `submitMissingTasks()`, if a stage itself is isIndeterminate, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. How do we detect and handle nondeterministic now: - During query execution, we keep track on the checksums produced by each map task. - When a task completes and checksum mismatch is detected, we will abort unrollbackable succeeding stages of the stage with checksum mismatches. The failed stages resubmission still happen in the same places as before. - In `submitMissingTasks()`, if the parent of a stage has checksum mismatches, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage. Note that (1) if a stage `isReliablyCheckpointed`, the consumer stages don't need to have whole stage retry, and (2) when mismatches are detected for a stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> ...), the direct consumer (e.g., stage_i+1) of the stage will have a whole stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage retry when its parent detects checksum mismatches. ### Why are the changes needed? Handle nondeterministic issues caused by the retry of shuffle map task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UTs added. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
