attilapiros commented on PR #50630: URL: https://github.com/apache/spark/pull/50630#issuecomment-2814650652
@mridulm > Only unsuccessful (and so uncommitted) tasks are candidates for (re)execution (and so commit) - not completed tasks. > So if a partition has completed task commit, it wont be reexecuted - spark ensures this w.r.t use of FileOutputCommitter But that's also bad for an indeterminate stage as the data is inconsistent. I mean the committed partitions are coming from a previous old computation and not from the latest one but the resubmitted ones are coming from the new one. To illustrate it: ``` scala> import org.apache.spark.sql.functions.udf scala> val myudf = udf(() => { val rnd = new java.util.Random(); rnd.nextInt(10)}).asNondeterministic() scala> spark.udf.register("myudf", myudf) scala> val df = sql("SELECT rand, count(rand) as cnt from (SELECT myudf() as rand from explode(sequence(1, 1000))) GROUP BY rand") scala> df.show +----+---+ |rand|cnt| +----+---+ | 1|122| | 6|110| | 3|111| | 5| 85| | 9| 99| | 4| 94| | 8| 93| | 7| 88| | 2| 98| | 0|100| +----+---+ scala> df.selectExpr("sum(cnt)").show +--------+ |sum(cnt)| +--------+ | 1000| +--------+ ``` So if we write the `df` to a table and some but not all tasks was successful and a resubmit happened we might have inconsistent result where `sum(cnt)` won't be 1000 when we load back the data as the resubmit might run on the shuffle map stage which regenerated the random values but with a different distribution of the value from 0 to 10. The complete shuffle map stage re-executed but the result stage not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org