We have referred https://iceberg.incubator.apache.org/custom-catalog/ and implemented atomic operation using dynamo optimistic locking. Iceberg codebase has has excellent test case to validate custom implementation. https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java
On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Gautam, > > Hadoop tables are not intended to be used when the file system doesn't > support atomic rename because of the problems you describe. Atomic rename > is a requirement for correctness in Hadoop tables. > > That is why we also have metastore tables, where some other atomic swap is > used. I strongly recommend using a metastore-based solution when your > underlying file system doesn't support atomic rename, like the Hive > catalog. We've also made it easy to plug in your own metastore using the > `BaseMetastore` classes. > > That said, if you have an idea to make Hadoop tables better, I'm all for > getting it in. But version hint file aside, without atomic rename, two > committers could still conflict and cause one of the commits to be dropped > because the second one to create any particular version's metadata file may > succeed. I don't see a way around this. > > If you don't want to use a metastore, then you could rely on a write lock > provided by ZooKeeper or something similar. > > On Tue, Jan 28, 2020 at 12:22 PM Gautam <gautamkows...@gmail.com> wrote: > >> Hello Devs, >> We are currently working on building out a high >> write throughput pipeline with Iceberg where hundreds or thousands of >> writers (and thousands of readers) could be accessing a table at any given >> moment. We are facing the issue called out by [1]. According to Iceberg's >> spec on write reliability [2], the writers depend on an atomic swap, which >> if fails should retry. While this may be true there can be instances where >> the current write has the latest table state but still fails to perform the >> swap or even worse the Reader sees an inconsistency while the write is >> being made. To my understanding, this stems from the fact that the current >> code [3] that does the swap assumes that the underlying filesystem provides >> an atomic rename api ( like hdfs et al) to the version hint file which >> keeps track of the current version. If the filesystem does not provide this >> then it fails with a fatal error. I think Iceberg should provide some >> resiliency here in committing the version once it knows that the latest >> table state is still valid and more importantly ensure the readers never >> fail during commit. If we agree I can work on adding this into Iceberg. >> >> How are folks handling write/read consistency cases where the underlying >> fs doesn't provide atomic apis for file overwrite/rename? We'v outlined >> the details in the attached issue#758 [1] .. What do folks think? >> >> Cheers, >> -Gautam. >> >> [1] - https://github.com/apache/incubator-iceberg/issues/758 >> [2] - https://iceberg.incubator.apache.org/reliability/ >> [3] - >> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220 >> > > > -- > Ryan Blue > Software Engineer > Netflix >