attilapiros commented on PR #50630:
URL: https://github.com/apache/spark/pull/50630#issuecomment-2814650652

   @mridulm 
   
   > Only unsuccessful (and so uncommitted) tasks are candidates for 
(re)execution (and so commit) - not completed tasks.
   > So if a partition has completed task commit, it wont be reexecuted - spark 
ensures this w.r.t use of FileOutputCommitter
   
   But that's also bad for an indeterminate stage as the data is inconsistent. 
I mean the committed partitions are coming from a previous old computation and 
not from the latest one but the resubmitted ones are coming from the new one.
   
   To illustrate it:
   ```
   scala> import org.apache.spark.sql.functions.udf
   scala> val myudf = udf(() => { val rnd = new java.util.Random(); 
rnd.nextInt(10)}).asNondeterministic()
   scala> spark.udf.register("myudf", myudf)
   scala> val df = sql("SELECT rand, count(rand) as cnt from (SELECT myudf() as 
rand from explode(sequence(1, 1000))) GROUP BY rand")
   scala> df.show
   +----+---+
   |rand|cnt|
   +----+---+
   |   1|122|
   |   6|110|
   |   3|111|
   |   5| 85|
   |   9| 99|
   |   4| 94|
   |   8| 93|
   |   7| 88|
   |   2| 98|
   |   0|100|
   +----+---+
   scala> df.selectExpr("sum(cnt)").show
   +--------+
   |sum(cnt)|
   +--------+
   |    1000|
   +--------+
   ``` 
    
   So if we write the `df` to a table and some but not all tasks was successful 
and a resubmit happened we might have inconsistent result where `sum(cnt)` 
won't be 1000 when we load back the data as the resubmit might run on the 
shuffle map stage which regenerated the random values but with a different 
distribution of the value from 0 to 10. The complete shuffle map stage 
re-executed but the result stage not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to