Thanks Ryan and Suds for the suggestions, we are looking into these options.
We currently don't have any external catalog or locking service and depend purely on commit retries. Additionally, we don't have any of our meta data in Hive Metastore, and, we want to leverage the underlying filesystem to read the table metadata, using the splitable nature of Iceberg's metadata. I think to be able to keep split planning the way it's done today and achieve consistency we need to be able to swap metadata consistently we would need to be able to acquire / release lock (using ZK or otherwise) in our CustomTableOperations's *doCommit* implementation. Thanks for the guidance, -Gautam. On Tue, Jan 28, 2020 at 2:55 PM Ryan Blue <rb...@netflix.com> wrote: > Thanks for pointing out those references, suds! > > And thanks to Mouli (for writing the doc) and Anton (for writing the test)! > > On Tue, Jan 28, 2020 at 2:05 PM suds <sudssf2...@gmail.com> wrote: > >> We have referred https://iceberg.incubator.apache.org/custom-catalog/ and >> implemented atomic operation using dynamo optimistic locking. Iceberg >> codebase has has excellent test case to validate custom implementation. >> >> https://github.com/apache/incubator-iceberg/blob/master/hive/src/test/java/org/apache/iceberg/hive/TestHiveTableConcurrency.java >> >> >> On Tue, Jan 28, 2020 at 1:35 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Hi Gautam, >>> >>> Hadoop tables are not intended to be used when the file system doesn't >>> support atomic rename because of the problems you describe. Atomic rename >>> is a requirement for correctness in Hadoop tables. >>> >>> That is why we also have metastore tables, where some other atomic swap >>> is used. I strongly recommend using a metastore-based solution when your >>> underlying file system doesn't support atomic rename, like the Hive >>> catalog. We've also made it easy to plug in your own metastore using the >>> `BaseMetastore` classes. >>> >>> That said, if you have an idea to make Hadoop tables better, I'm all for >>> getting it in. But version hint file aside, without atomic rename, two >>> committers could still conflict and cause one of the commits to be dropped >>> because the second one to create any particular version's metadata file may >>> succeed. I don't see a way around this. >>> >>> If you don't want to use a metastore, then you could rely on a write >>> lock provided by ZooKeeper or something similar. >>> >>> On Tue, Jan 28, 2020 at 12:22 PM Gautam <gautamkows...@gmail.com> wrote: >>> >>>> Hello Devs, >>>> We are currently working on building out a high >>>> write throughput pipeline with Iceberg where hundreds or thousands of >>>> writers (and thousands of readers) could be accessing a table at any given >>>> moment. We are facing the issue called out by [1]. According to Iceberg's >>>> spec on write reliability [2], the writers depend on an atomic swap, which >>>> if fails should retry. While this may be true there can be instances where >>>> the current write has the latest table state but still fails to perform the >>>> swap or even worse the Reader sees an inconsistency while the write is >>>> being made. To my understanding, this stems from the fact that the current >>>> code [3] that does the swap assumes that the underlying filesystem provides >>>> an atomic rename api ( like hdfs et al) to the version hint file which >>>> keeps track of the current version. If the filesystem does not provide this >>>> then it fails with a fatal error. I think Iceberg should provide some >>>> resiliency here in committing the version once it knows that the latest >>>> table state is still valid and more importantly ensure the readers never >>>> fail during commit. If we agree I can work on adding this into Iceberg. >>>> >>>> How are folks handling write/read consistency cases where the >>>> underlying fs doesn't provide atomic apis for file overwrite/rename? We'v >>>> outlined the details in the attached issue#758 [1] .. What do folks think? >>>> >>>> Cheers, >>>> -Gautam. >>>> >>>> [1] - https://github.com/apache/incubator-iceberg/issues/758 >>>> [2] - https://iceberg.incubator.apache.org/reliability/ >>>> [3] - >>>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L220 >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >