attilapiros commented on PR #50033: URL: https://github.com/apache/spark/pull/50033#issuecomment-2799692246
@mridulm IMHO regarding an indeterministic result stage we should abort the stage more aggressively as we cannot re-execute any of its tasks twice as on the executor side repeating the operation with different data can lead to corrupted results. One example is the using [FileOutputCommitter](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputCommitter.html#commitTask-org.apache.hadoop.mapreduce.TaskAttemptContext-). Here I cannot see there are any guarantees of re-execution a Hadoop Task commit with different data. The other good/better example is writing to an external DB via JDBC: Here you can see it iterates over on the partitions and calls INSERT INTOs: https://github.com/apache/spark/blob/1fa05b8cb755bbf2432a37a96bcaf329982b7684/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L990-L992 where one INSERT INTO is SQL dialect specific: https://github.com/apache/spark/blob/1fa05b8cb755bbf2432a37a96bcaf329982b7684/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L983 So any re-execute will duplicate the data. I think this is why https://issues.apache.org/jira/browse/SPARK-25342 is opened. @mridulm WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org