when describing the resilience of the magic committer to failures during a task commit, the docs state
"If the .pendingset file has been saved to the job attempt directory, the task has effectively committed, it has just failed to report to the controller. This will cause complications during job commit, as there may be two task PendingSet committing the same files, or committing files with *Proposed*: track task ID in pendingsets, recognise duplicates on load and then respond by cancelling one set and committing the other. (or fail?)" As far as I can tell from reading over the code, the proposal was not implemented. Is this still considered a viable solution? If so, I'd be happy to take a crack at it For context: We are running spark jobs using the magic committer in a k8s environment where executor loss is somewhat common (maybe about 2% of executors are terminated by the cluster). By default, spark's OutputCommitCoordinator fails the whole write stage (and parent job) if an executor fails while it an "authorized committer" (see https://issues.apache.org/jira/browse/SPARK-39195). This results in expensive job failures at the final write stage. We can avoid those failures by disabling the outputCommitCoordinator entirely, but that leaves us open to the failure mode described above in the s3a docs. It seems to me that if the magic committer implemented the above proposal to track task ids in the pendingset, the OutputCommitCoordinator would no longer be needed at all. I think this ticket ( https://issues.apache.org/jira/browse/SPARK-19790) discusses similar ideas about how to make spark's commit coordiantor work with the s3a committers