mridulm commented on PR #50033:
URL: https://github.com/apache/spark/pull/50033#issuecomment-2800300563

   > Even if the map stage determinate a fetch failure will lead to executor 
lost which can remove map output of the indeterminate stage and when the result 
stage is resubmitted the indeterminate parent will be detected as missing and 
will be resubmitted.
   
   The indeterminate stage here is a `ResultStage` - so the scenario described 
does not apply (no shuffle output).
   We do need to ensure committed task output is handled properly , which is 
similar to the case of spec exec/node loss/task reexecution due to non-shuffle 
fetch failures/...
   
   For the case of shuffle map stage, `submitMissingTasks` does clear the 
shuffle output of stage when it is a new attempt is resubmitted (the first 
match statement).
   
   With that out of the way, let us look at the scenario described:
   
   > So when we have 3 stages:
   > - `ShuffleMapStage1` (`hostA_exec`, `hostB_exec`), determinate
   > - `ShuffleMapStage2` (`hostA_exec,` `hostB_exec`), indeterminate
   > - `ResultStage` depending on `ShuffleMapStage1` and `ShuffleMapStage2`
   > 
   > A `FetchFailure` when `ResultStage` is fetching from `ShuffleMapStage1` 
will lead to failing both `ShuffleMapStage1` and `ResultStage` and even 
removing the executor so removing the map output as well. So when `ResultStage` 
is resubmitted its parent `ShuffleMapStage2` will be missing too.
   
   There are bunch of cases here, and we will need to analyze their impact. I 
am focussing on two main cases:
   
   For the common case, when `ResultStage` is reexecuted, it will result in 
`FetchFailure` when fetching output of `ShuffleMapStage2` - which then results 
in aborting the job by the existing code fetch failure handling (this is 
boiling down to the simple case of existing indeterminate shuffle-map-stage -> 
result-stage, with result-stage having completed tasks).
   
   
   Having said that:
   
   I will need to recheck if `handleExecutorLost` handles impact on 
indeterminate stage properly : but from a cursory read, it is likely that the 
scenario you  described @attilapiros is not handled there ?
   
   That is, if `fileLost == true`, we might want to do something similar to 
what we are doing in `handleTaskCompletion` when there is a fetch failure ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to