mridulm commented on PR #50033: URL: https://github.com/apache/spark/pull/50033#issuecomment-2800300563
> Even if the map stage determinate a fetch failure will lead to executor lost which can remove map output of the indeterminate stage and when the result stage is resubmitted the indeterminate parent will be detected as missing and will be resubmitted. The indeterminate stage here is a `ResultStage` - so the scenario described does not apply (no shuffle output). We do need to ensure committed task output is handled properly , which is similar to the case of spec exec/node loss/task reexecution due to non-shuffle fetch failures/... For the case of shuffle map stage, `submitMissingTasks` does clear the shuffle output of stage when it is a new attempt is resubmitted (the first match statement). With that out of the way, let us look at the scenario described: > So when we have 3 stages: > - `ShuffleMapStage1` (`hostA_exec`, `hostB_exec`), determinate > - `ShuffleMapStage2` (`hostA_exec,` `hostB_exec`), indeterminate > - `ResultStage` depending on `ShuffleMapStage1` and `ShuffleMapStage2` > > A `FetchFailure` when `ResultStage` is fetching from `ShuffleMapStage1` will lead to failing both `ShuffleMapStage1` and `ResultStage` and even removing the executor so removing the map output as well. So when `ResultStage` is resubmitted its parent `ShuffleMapStage2` will be missing too. There are bunch of cases here, and we will need to analyze their impact. I am focussing on two main cases: For the common case, when `ResultStage` is reexecuted, it will result in `FetchFailure` when fetching output of `ShuffleMapStage2` - which then results in aborting the job by the existing code fetch failure handling (this is boiling down to the simple case of existing indeterminate shuffle-map-stage -> result-stage, with result-stage having completed tasks). Having said that: I will need to recheck if `handleExecutorLost` handles impact on indeterminate stage properly : but from a cursory read, it is likely that the scenario you described @attilapiros is not handled there ? That is, if `fileLost == true`, we might want to do something similar to what we are doing in `handleTaskCompletion` when there is a fetch failure ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org