Hello Devs, We are currently working on building out a high write throughput pipeline with Iceberg where hundreds or thousands of writers (and thousands of readers) could be accessing a table at any given moment. We are facing the issue called out by [1]. According to Iceberg's spec on write reliability [2], the writers depend on an atomic swap, which if fails should retry. While this may be true there can be instances where the current write has the latest table state but still fails to perform the swap or even worse the Reader sees an inconsistency while the write is being made. To my understanding, this stems from the fact that the current code [3] that does the swap assumes that the underlying filesystem provides an atomic rename api ( like hdfs et al) to the version hint file which keeps track of the current version. If the filesystem does not provide this then it fails with a fatal error. I think Iceberg should provide some resiliency here in committing the version once it knows that the latest table state is still valid and more importantly ensure the readers never fail during commit. If we agree I can work on adding this into Iceberg.
How are folks handling write/read consistency cases where the underlying fs doesn't provide atomic apis for file overwrite/rename? We'v outlined the details in the attached issue#758 [1] .. What do folks think? Cheers, -Gautam. [1] - https://github.com/apache/incubator-iceberg/issues/758 [2] - https://iceberg.incubator.apache.org/reliability/ [3] - https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220