Hi John, Thanks for the experience report and pointers to resources. :) If we do end up going down that road it'll be super helpful.
Marko On Thu, Nov 19, 2020 at 12:29 PM John Clara <john.anthony.cl...@gmail.com> wrote: > Hi, > > My team has been using the custom catalog along with atomic metadata > updates but we never migrated existing iceberg tables onto it. We also > haven't turned on integration with the hive catalog, so I'm not sure how > > easy it is to plug in there (I think there was some recent work on > that?). Dynamo provides a local mock which you could combine with s3mock > (check iceberg tests) to test it out: > > https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html > > > Only weird things we've run into with dynamo is: > 1. it seems like we get rate limited by dynamo pretty hard when first > writing to a new table until rate limits are adjusted (potentially by > aws dynamically dynamo's internal partitions?) > 2. make sure to page scans if you have a lot of values when doing lists > (we haven't enabled catalog listing yet, but we've ran into this before) > > We chose dynamo because we were using it for other usecases. I'm not > sure if it's the best aws provided option for atomic changes. > > John > > On 11/19/20 10:07 AM, Marko Babic wrote: > > Hi everyone, > > > > At my org we’ve spun up a few Iceberg tables on top of S3 without a > > metastore (conscious of the consequences) and we’ve arrived at the > > point that we need to support concurrent writes. :) I was hoping to > > get some advice as to what the best way to integrate an existing > > Iceberg table into a Hive Metastore or an alternative might be. We’re > > still relatively early in our adoption of Iceberg and have no real > > prior experience with Hive so I don’t know what I don’t know. > > > > Some options we’re weighing: > > > > - Existing tables aren’t so big that the moral equivalent of "CREATE > > TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the > > question, but we’d prefer to not have to read + rewrite everything. We > > also have stateful readers (tracking which snapshots they have > > previously read) and preserving table history would make life easier. > > > > - Doing something along the lines of the following and importing the > > tables into Hive as external tables looks like it should work given my > > understanding of how Iceberg is using HMS, but I don’t know if it’s > > encouraged and I haven’t done diligence to understand potential > > consequences: > > > > ``` > > hive> CREATE EXTERNAL TABLE `existing_table` (...) > > LOCATION > > 's3://existing-table/' > > -- serde, input/output formats omitted > > TBLPROPERTIES ( > > -- Assuming latest metadata file for Hadoop table is > > v99.metadata.json, rename it to 00099-uuid.metadata.json > > -- so that BaseMetastoreTableOperations can correctly parse the > > version number. > > > 'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json', > > 'table_type'='ICEBERG' > > ) > > ``` > > > > - Others seem to have had success implementing + maintaining a > > custom catalog (https://iceberg.apache.org/custom-catalog/ > > <https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB > > for atomic metadata updates, which could appeal to us. Seems like > > migration in this case consists of implementing the catalog and > > plopping the latest metadata into the backing store. Are custom > > catalogs more of an escape hatch when HMS can’t be used, or would that > > maybe be a reasonable way forward if we find we don’t want to maintain > > + operate on top of HMS? > > > > Apologies if this was discussed or documented somewhere else and I’ve > > missed it. > > > > Thanks! > > > > Marko >