Re: Integrating Existing Iceberg Tables with a Metastore

Marko Babic Fri, 20 Nov 2020 10:38:31 -0800

Hi John,

Thanks for the experience report and pointers to resources. :) If we do end
up going down that road it'll be super helpful.


Marko

On Thu, Nov 19, 2020 at 12:29 PM John Clara <john.anthony.cl...@gmail.com>
wrote:

> Hi,
>
> My team has been using the custom catalog along with atomic metadata
> updates but we never migrated existing iceberg tables onto it. We also
> haven't turned on integration with the hive catalog, so I'm not sure how
>
> easy it is to plug in there (I think there was some recent work on
> that?). Dynamo provides a local mock which you could combine with s3mock
> (check iceberg tests) to test it out:
>
> https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
>
>
> Only weird things we've run into with dynamo is:
> 1. it seems like we get rate limited by dynamo pretty hard when first
> writing to a new table until rate limits are adjusted (potentially by
> aws dynamically dynamo's internal partitions?)
> 2. make sure to page scans if you have a lot of values when doing lists
> (we haven't enabled catalog listing yet, but we've ran into this before)
>
> We chose dynamo because we were using it for other usecases. I'm not
> sure if it's the best aws provided option for atomic changes.
>
> John
>
> On 11/19/20 10:07 AM, Marko Babic wrote:
> > Hi everyone,
> >
> > At my org we’ve spun up a few Iceberg tables on top of S3 without a
> > metastore (conscious of the consequences) and we’ve arrived at the
> > point that we need to support concurrent writes. :) I was hoping to
> > get some advice as to what the best way to integrate an existing
> > Iceberg table into a Hive Metastore or an alternative might be. We’re
> > still relatively early in our adoption of Iceberg and have no real
> > prior experience with Hive so I don’t know what I don’t know.
> >
> > Some options we’re weighing:
> >
> >   - Existing tables aren’t so big that the moral equivalent of "CREATE
> > TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the
> > question, but we’d prefer to not have to read + rewrite everything. We
> > also have stateful readers (tracking which snapshots they have
> > previously read) and preserving table history would make life easier.
> >
> >   - Doing something along the lines of the following and importing the
> > tables into Hive as external tables looks like it should work given my
> > understanding of how Iceberg is using HMS, but I don’t know if it’s
> > encouraged and I haven’t done diligence to understand potential
> > consequences:
> >
> > ```
> > hive> CREATE EXTERNAL TABLE `existing_table` (...)
> > LOCATION
> >   's3://existing-table/'
> > -- serde, input/output formats omitted
> > TBLPROPERTIES (
> >   -- Assuming latest metadata file for Hadoop table is
> > v99.metadata.json, rename it to 00099-uuid.metadata.json
> >   -- so that BaseMetastoreTableOperations can correctly parse the
> > version number.
> >
> 'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json',
> >   'table_type'='ICEBERG'
> > )
> > ```
> >
> >   - Others seem to have had success implementing + maintaining a
> > custom catalog (https://iceberg.apache.org/custom-catalog/
> > <https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB
> > for atomic metadata updates, which could appeal to us. Seems like
> > migration in this case consists of implementing the catalog and
> > plopping the latest metadata into the backing store. Are custom
> > catalogs more of an escape hatch when HMS can’t be used, or would that
> > maybe be a reasonable way forward if we find we don’t want to maintain
> > + operate on top of HMS?
> >
> > Apologies if this was discussed or documented somewhere else and I’ve
> > missed it.
> >
> > Thanks!
> >
> > Marko
>

Re: Integrating Existing Iceberg Tables with a Metastore

Reply via email to