attilapiros commented on PR #50033: URL: https://github.com/apache/spark/pull/50033#issuecomment-2801616710
> If yes, we could be more aggressive when handling this case - > Invalidate all downstream shuffle output > Any result stage which has/had started, and not completed - fail that job. > Does this align with your observations/analysis @attilapiros ? We are getting closer. > Specifically about JDBC - assuming it is not due to the case we discussed above - I am not entirely sure :-) > If the commit protocol has been correctly implemented, we will need to understand that better ... There is transaction management for writing the rows of a partition (so for a task): https://github.com/apache/spark/blob/1fa05b8cb755bbf2432a37a96bcaf329982b7684/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L804-L807 But the re-execution of a task will do the duplication as I see. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org