Hi Peter. Thanks for responding. > The command you mention below: `CREATE EXTERNAL TABLE` above an existing Iceberg table will not transfer the "responsibility" of tracking the snapshot to HMS. It only creates a HMS external table ...
So my understanding is that the HiveCatalog is basically just using HMS as an atomically updateable pointer to a metadata file (excepting recent work to make Iceberg tables queryable _from_ Hive which we won't be doing) so what I'm doing with that command is mimicking the DDL for a HiveCatalog-created table, which sets up Iceberg tables as external tables in HMS <https://github.com/apache/iceberg/blob/f8c68ebcb4e35db5d7f5ccb8e20d53df3abdf8b1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L230-L292>, and manually updating the `metadata_location` table property to point to latest metadata file for the existing table that I want to integrate with HMS. Updating the metadata pointer, along with obviously updating all readers/writers to load the table via the HiveCatalog, seems to be all I need to do to make that work but I'm just naively dipping my toes in here and could absolutely be missing something. E.g. I figured out I'd have to rename the latest metadata file from the existing table I want to integrate with HMS so that BaseMetastoreTableOperations could parse the version number, but only realized later that I'd have to rename _all_ the old metadata files + twiddle the metadata log entries to use the updated names. > What I would do is this: ... Makes sense, the external table creation + metadata pointer mangling is my attempt to do basically this but I'm not confident I know everything that needs to go into making step 2 happen. :) The following is what I'm thinking: - Given an existing Hadoop table on top of S3 at s3://old_table/, create a new table with the same schema + partition spec via HiveCatalog. - Parse metadata files from the old table and update them to be HadoopCatalog-compatible: all I'd be updating is metadata file names + the metadata log as described above. - Write updated metadata files to s3://old_table/metadata/. Update new table in HMS to point to latest, updated metadata file and update the table location to point to s3://old_table/. I could alternatively `aws s3 sync` data files from the old table to the new one, rewrite all the old metadata + snapshot manifest lists + manifest files to point to the new data directory, and leave s3://old_table/ untouched, but I guess that's a decision I'd make once I'm into things and have a better sense of what'd be less error-prone. Thanks again! Marko On Fri, Nov 20, 2020 at 12:39 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > Hi Marko, > > The command you mention below: `CREATE EXTERNAL TABLE` above an existing > Iceberg table will not transfer the "responsibility" of tracking the > snapshot to HMS. It only creates a HMS external table which will allow Hive > queries to read the given table. If you want to track the snapshot in the > HMS then you have to originally create a table in HMS using HiveCatalog. > > What I would do is this: > > 1. Create a new Iceberg table in a catalog which supports concurrent > writes (Hive/Hadoop/Custom) > 2. Migrate the tables to the new catalog. Maybe there are some already > existing tools there, or with some java/spark code the snapshot files can > be read and rewritten. By my understanding you definitely do not have to > rewrite the data files, just the snapshot files (and maybe the manifest > files) > > Hope this helps, > Peter > > > On Nov 19, 2020, at 21:29, John Clara <john.anthony.cl...@gmail.com> > wrote: > > Hi, > > My team has been using the custom catalog along with atomic metadata > updates but we never migrated existing iceberg tables onto it. We also > haven't turned on integration with the hive catalog, so I'm not sure how > > easy it is to plug in there (I think there was some recent work on that?). > Dynamo provides a local mock which you could combine with s3mock (check > iceberg tests) to test it out: > https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html > > > Only weird things we've run into with dynamo is: > 1. it seems like we get rate limited by dynamo pretty hard when first > writing to a new table until rate limits are adjusted (potentially by aws > dynamically dynamo's internal partitions?) > 2. make sure to page scans if you have a lot of values when doing lists > (we haven't enabled catalog listing yet, but we've ran into this before) > > We chose dynamo because we were using it for other usecases. I'm not sure > if it's the best aws provided option for atomic changes. > > John > > On 11/19/20 10:07 AM, Marko Babic wrote: > > Hi everyone, > > At my org we’ve spun up a few Iceberg tables on top of S3 without a > metastore (conscious of the consequences) and we’ve arrived at the point > that we need to support concurrent writes. :) I was hoping to get some > advice as to what the best way to integrate an existing Iceberg table into > a Hive Metastore or an alternative might be. We’re still relatively early > in our adoption of Iceberg and have no real prior experience with Hive so I > don’t know what I don’t know. > > Some options we’re weighing: > > - Existing tables aren’t so big that the moral equivalent of "CREATE > TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the > question, but we’d prefer to not have to read + rewrite everything. We also > have stateful readers (tracking which snapshots they have previously read) > and preserving table history would make life easier. > > - Doing something along the lines of the following and importing the > tables into Hive as external tables looks like it should work given my > understanding of how Iceberg is using HMS, but I don’t know if it’s > encouraged and I haven’t done diligence to understand potential > consequences: > > ``` > hive> CREATE EXTERNAL TABLE `existing_table` (...) > LOCATION > 's3://existing-table/' > -- serde, input/output formats omitted > TBLPROPERTIES ( > -- Assuming latest metadata file for Hadoop table is v99.metadata.json, > rename it to 00099-uuid.metadata.json > -- so that BaseMetastoreTableOperations can correctly parse the version > number. > 'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json > ', > 'table_type'='ICEBERG' > ) > ``` > > - Others seem to have had success implementing + maintaining a custom > catalog (https://iceberg.apache.org/custom-catalog/ < > https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB for > atomic metadata updates, which could appeal to us. Seems like migration in > this case consists of implementing the catalog and plopping the latest > metadata into the backing store. Are custom catalogs more of an escape > hatch when HMS can’t be used, or would that maybe be a reasonable way > forward if we find we don’t want to maintain + operate on top of HMS? > > Apologies if this was discussed or documented somewhere else and I’ve > missed it. > > Thanks! > > Marko > > >