ivoson opened a new pull request, #52336:
URL: https://github.com/apache/spark/pull/52336

   ### What changes were proposed in this pull request?
   This PR proposes to retry all tasks of the consumer stages, when checksum 
mismatches are detected on their producer stages. In the case that we can't 
rollback and retry all tasks of a consumer stage, we will have to abort the 
stage (thus the job).
   
   How do we detect and handle nondeterministic before:
   - Stages are labeled as indeterminate at planning time, prior to query 
execution
   - When a task completes and `FetchFailed` is detected, we will abort all 
unrollbackable succeeding stages of the map stage, and resubmit failed stages.
   - In `submitMissingTasks()`, if a stage itself is isIndeterminate, we will 
call `unregisterAllMapAndMergeOutput()` and retry all tasks for stage.
   
   How do we detect and handle nondeterministic now:
   - During query execution, we keep track on the checksums produced by each 
map task.
   - When a task completes and checksum mismatch is detected, we will abort 
unrollbackable succeeding stages of the stage with checksum mismatches. The 
failed stages resubmission still happen in the same places as before.
   - In `submitMissingTasks()`, if the parent of a stage has checksum 
mismatches, we will call `unregisterAllMapAndMergeOutput()` and retry all tasks 
for stage.
   
   Note that (1) if a stage `isReliablyCheckpointed`, the consumer stages don't 
need to have whole stage retry, and (2) when mismatches are detected for a 
stage in a chain (e.g., the first stage in stage_i -> stage_i+1 -> stage_i+2 -> 
...), the direct consumer (e.g., stage_i+1) of the stage will have a whole 
stage retry, and an indirect consumer (e.g., stage_i+2) will have a whole stage 
retry when its parent detects checksum mismatches.
   
   ### Why are the changes needed?
   Handle nondeterministic issues caused by the retry of shuffle map task.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   UTs added.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to