Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

via GitHub Mon, 07 Apr 2025 14:05:31 -0700


ahshahid commented on PR #50033:
URL: https://github.com/apache/spark/pull/50033#issuecomment-2784622753


   @attilapiros : I dont quite understand, how the situation you have described 
can arise in this PR.
   
   In the handleTaskCompletion, the first check is:
   ```
   
       val isIndeterministicZombie = event.reason match {
         case Success if stageOption.isDefined =>
           val stage = stageOption.get
           (task.stageAttemptId < stage.latestInfo.attemptNumber()
             && stage.isIndeterminate) || 
stage.shouldDiscardResult(task.stageAttemptId)
   
         case _ => false
       }
   ```
   So lets assume the case, 
   1) the first partition result is a failure.
   2) Before sending the asynchronous ResubmitFailure message, it sets the flag 
in the stage via the call
   def markAllPartitionsMissing()
   
   3) Now before resubmit (& hence increase of the attempt number in the 
stage), a successful result task for the same result stage gets processed, the 
code at the start of handleTaskCompletion, will mark it as 
isIndeterministicZombie. And so that output will get discarded..without 
committing anything to file.
   
   4) When the resubmit failure  increases the attempt id, the flag to 
discardResult Task is reset to false.
   
   And for the second case, where the fisrt result task is successful, but 
subsequent task fails, then the code will follow the existing path of aborting 
the query.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

Reply via email to