Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

via GitHub Fri, 14 Mar 2025 18:32:35 -0700


ahshahid commented on code in PR #50033:
URL: https://github.com/apache/spark/pull/50033#discussion_r1996488616



##########
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala:
##########
@@ -90,8 +90,11 @@ private[spark] class ShuffleMapStage(
 
   /** Returns the sequence of partition ids that are missing (i.e. needs to be 
computed). */
   override def findMissingPartitions(): Seq[Int] = {
-    mapOutputTrackerMaster
-      .findMissingPartitions(shuffleDep.shuffleId)
-      .getOrElse(0 until numPartitions)
+    if (this.areAllPartitionsMissing(this.latestInfo.attemptNumber())) {

Review Comment:
   @attilapiros 
   For ShuffleMapStage specifically , you are right, that 
attemptIdAllPartitionsMissing is not needed, because of the following check in 
function 
   submitMissingTasks
   
     ```
     stage match {
         case sms: ShuffleMapStage if stage.isIndeterminate && !sms.isAvailable 
=>
           
mapOutputTracker.unregisterAllMapAndMergeOutput(sms.shuffleDep.shuffleId)
           sms.shuffleDep.newShuffleMergeState()
   
         case _ =>
       }
   ```
   
   One may be inclined to think that for ResultStage too , a similar call to 
just set all partitions as missing , in the above code should suffice.  which 
should make the newly added fields completely redundant.
   
   But then the question will be how does one ensure that  result completion 
task of the current failed attempt is not able to proceed with commiting data.. 
Which means that the newly added field will be needed atleast for ResultStage.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-51272][CORE]. Fix for the race condition in Scheduler causing failure in retrying all partitions in case of indeterministic shuffle keys [spark]

Reply via email to