attilapiros commented on code in PR #50033: URL: https://github.com/apache/spark/pull/50033#discussion_r2023833320
########## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ########## @@ -3185,16 +3202,164 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti "Spark can only do this while using the new shuffle block fetching protocol")) } + + + test("SPARK-51272: retry all the partitions of result stage, if the first result task" + + " has failed and failing ShuffleMap stage is inDeterminate") { + this.dagSchedulerInterceptor = createDagInterceptorForSpark51272( + () => taskSets(2).tasks(1), "RELEASE_LATCH") + + val numPartitions = 2 + // The first shuffle stage is completed by the below function itself which creates two + // indeterminate stages. + val (shuffleId1, shuffleId2) = constructTwoStages( + stage1InDeterminate = false, stage2InDeterminate = true) + completeShuffleMapStageSuccessfully(shuffleId2, 0, numPartitions) + val resultStage = scheduler.stageIdToStage(2).asInstanceOf[ResultStage] + val activeJob = resultStage.activeJob + assert(activeJob.isDefined) + // The result stage is still waiting for its 2 tasks to complete + assert(resultStage.findMissingPartitions() == Seq.tabulate(numPartitions)(i => i)) + + // The below event is going to initiate the retry of previous indeterminate stages, and also + // the retry of all result tasks. But before the "ResubmitFailedStages" event is added to the + // queue of Scheduler, a successful completion of the result partition task is added to the + // event queue. Due to scenario, the bug surfaces where instead of retry of all partitions + // of result tasks (2 tasks in total), only some (1 task) get retried + runEvent( + makeCompletionEvent( + taskSets(2).tasks(0), + FetchFailed(makeBlockManagerId("hostA"), shuffleId1, 0L, 0, 0, "ignored"), Review Comment: I do not think this is a valid test case. The `shuffleId1` belongs to the deterministic Stage. See the log: ``` 25/04/01 16:43:47.983 pool-1-thread-1-ScalaTest-running-DAGSchedulerSuite INFO DAGSchedulerSuite$MyDAGScheduler: Marking ResultStage 2 () as failed due to a fetch failure from ShuffleMapStage 0 (RDD at DAGSchedulerSuite.scala:127) ``` And the ResultStage 2 depends on the ShuffleMapStage 1 which is not recomputed just because of a fetch failure from ShuffleMapStage 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org