attilapiros commented on PR #50033:
URL: https://github.com/apache/spark/pull/50033#issuecomment-2799692246

   @mridulm IMHO regarding an indeterministic result stage we should abort the 
stage more aggressively as we cannot re-execute any of its tasks twice as on 
the executor side repeating the operation with different data can lead to 
corrupted results.
   
   One example is the using 
[FileOutputCommitter](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html#commitTask-org.apache.hadoop.mapreduce.TaskAttemptContext-).
  Here I cannot see there are any guarantees of re-execution a Hadoop Task 
commit with different data.
   
   The other good/better example is writing to an external DB via JDBC:
   
   Here you can see it iterates over on the partitions and calls INSERT INTOs:
   
   
https://github.com/apache/spark/blob/1fa05b8cb755bbf2432a37a96bcaf329982b7684/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L990-L992
   
   where one INSERT INTO is SQL dialect specific:
   
   
https://github.com/apache/spark/blob/1fa05b8cb755bbf2432a37a96bcaf329982b7684/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L983
   
   So any re-execute will duplicate the data.
   
   I think this is why https://issues.apache.org/jira/browse/SPARK-25342 is 
opened.
   
   @mridulm WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to