attilapiros opened a new pull request, #50946:
URL: https://github.com/apache/spark/pull/50946

   What changes were proposed in this pull request?
   This PR aborts the indeterminate partially completed result stage instead of 
resubmitting it.
   
   Why are the changes needed?
   A result stage compared to shuffle map stage has more output and more 
intermediate state:
   
   It can use a FileOutputCommitter where each task does a Hadoop task commit. 
In case of a re-submit this will lead to re-commit that Hadoop task (possibly 
with different content).
   In case of JDBC write it can already inserted all rows of a partitions into 
the target schema.
   Ignoring the resubmit when a recalculation is needed would cause data 
corruption as the partial result is based on the previous indeterminate 
computation but continuing means finishing the stage with the new recomputed 
data.
   
   As long as rollback of a result stage is not supported 
(https://issues.apache.org/jira/browse/SPARK-25342) the best we can do when a 
recalculation is needed is aborting the stage.
   
   The existing code before this PR already tried to address a similar 
situation at the handling of FetchFailed when the fetch is coming from an 
indeterminate shuffle map stage: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2178-L2182
   
   But this is not enough as a FetchFailed from a determinate stage can lead to 
an executor loss and a re-compute of the indeterminate parent of the result 
stage as shown in the attached unittest.
   
   Moreover the ResubmitFailedStages can be in race with a successful 
CompletionEvent. This is why this PR detects the partial execution at the 
re-submit of the indeterminate result stage.
   
   Does this PR introduce any user-facing change?
   No.
   
   How was this patch tested?
   New unit tests are created to illustrate the situation above.
   
   Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to