attilapiros commented on PR #50033: URL: https://github.com/apache/spark/pull/50033#issuecomment-2800042901
> For the https://github.com/apache/spark/pull/50033#discussion_r2040777376, map stage is determinate - so reexecution will not change input data for 'reducer' (though can change order) - same as spec execution for a partition of this result stage. Even if the map stage determinate a fetch failure will lead to executor lost which can remove map output of the indeterminate stage and when the result stage is resubmitted the indeterminate parent will be detected as missing and will be resubmitted. So when we have 3 stages: - `ShuffleMapStage1` (`hostA_exec`, `hostB_exec`), determinate - `ShuffleMapStage2` (`hostA_exec,` `hostB_exec`), indeterminate - `ResultStage` depending on `ShuffleMapStage1` and `ShuffleMapStage2` A `FetchFailure` when `ResultStage` is fetching from `ShuffleMapStage1` will lead to failing both `ShuffleMapStage1` and ResultStage. And when `ResultStage` is resubmitted its parent `ShuffleMapStage2` will be missing. I believe this is tested as https://github.com/apache/spark/blob/2d3cb7870fc597ad23067ddd21c67d4c3aa2e80b/core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala#L3222-L3271 Despite the test name says " failing ShuffleMap stage is inDeterminate" the `shuffleId1` used at the fetch failure belongs to the determinate stage. > If you https://github.com/apache/spark/pull/50033#discussion_r2040777376 or https://github.com/apache/spark/pull/50033#discussion_r2040764174, then that is no different from how speculative tasks for indeterminate stage(s) behave : if all partitions had completed and committed, they dont need to be recomputed Can it be in this case even speculative tasks are leading to errors for the writing to JDBC? As repeated `insert into`s in the best case when the data is the same it duplicates the data. If this is documented its fine as with primary keys at the target schema this can be detected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org