Re: Integrating Existing Iceberg Tables with a Metastore

Marko Babic Fri, 20 Nov 2020 10:59:12 -0800

Hi Peter. Thanks for responding.

> The command you mention below: `CREATE EXTERNAL TABLE` above an existing
Iceberg table will not transfer the "responsibility" of tracking the
snapshot to HMS. It only creates a HMS external table ...


So my understanding is that the HiveCatalog is basically just using HMS as
an atomically updateable pointer to a metadata file (excepting recent work
to make Iceberg tables queryable _from_ Hive which we won't be doing) so
what I'm doing with that command is mimicking the DDL for a
HiveCatalog-created table, which sets up Iceberg tables as external tables
in HMS
<https://github.com/apache/iceberg/blob/f8c68ebcb4e35db5d7f5ccb8e20d53df3abdf8b1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L230-L292>,
and
manually updating the `metadata_location` table property to point to latest
metadata file for the existing table that I want to integrate with HMS.
Updating the metadata pointer, along with obviously updating all
readers/writers to load the table via the HiveCatalog, seems to be all I
need to do to make that work but I'm just naively dipping my toes in here
and could absolutely be missing something. E.g. I figured out I'd have to
rename the latest metadata file from the existing table I want to integrate
with HMS so that BaseMetastoreTableOperations could parse the version
number, but only realized later that I'd have to rename _all_ the old
metadata files + twiddle the metadata log entries to use the updated names.

> What I would do is this: ...

Makes sense, the external table creation + metadata pointer mangling is my
attempt to do basically this but I'm not confident I know everything that
needs to go into making step 2 happen. :)

The following is what I'm thinking:

- Given an existing Hadoop table on top of S3 at s3://old_table/, create a
new table with the same schema + partition spec via HiveCatalog.
- Parse metadata files from the old table and update them to be
HadoopCatalog-compatible: all I'd be updating is metadata file names + the
metadata log as described above.
- Write updated metadata files to s3://old_table/metadata/. Update new
table in HMS to point to latest, updated metadata file and update the table
location to point to s3://old_table/.

I could alternatively `aws s3 sync` data files from the old table to the
new one, rewrite all the old metadata + snapshot manifest lists + manifest
files to point to the new data directory, and leave s3://old_table/
untouched, but I guess that's a decision I'd make once I'm into things and
have a better sense of what'd be less error-prone.

Thanks again!

Marko


On Fri, Nov 20, 2020 at 12:39 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Marko,
>
> The command you mention below: `CREATE EXTERNAL TABLE` above an existing
> Iceberg table will not transfer the "responsibility" of tracking the
> snapshot to HMS. It only creates a HMS external table which will allow Hive
> queries to read the given table. If you want to track the snapshot in the
> HMS then you have to originally create a table in HMS using HiveCatalog.
>
> What I would do is this:
>
>    1. Create a new Iceberg table in a catalog which supports concurrent
>    writes (Hive/Hadoop/Custom)
>    2. Migrate the tables to the new catalog. Maybe there are some already
>    existing tools there, or with some java/spark code the snapshot files can
>    be read and rewritten. By my understanding you definitely do not have to
>    rewrite the data files, just the snapshot files (and maybe the manifest
>    files)
>
> Hope this helps,
> Peter
>
>
> On Nov 19, 2020, at 21:29, John Clara <john.anthony.cl...@gmail.com>
> wrote:
>
> Hi,
>
> My team has been using the custom catalog along with atomic metadata
> updates but we never migrated existing iceberg tables onto it. We also
> haven't turned on integration with the hive catalog, so I'm not sure how
>
> easy it is to plug in there (I think there was some recent work on that?).
> Dynamo provides a local mock which you could combine with s3mock (check
> iceberg tests) to test it out:
> https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
>
>
> Only weird things we've run into with dynamo is:
> 1. it seems like we get rate limited by dynamo pretty hard when first
> writing to a new table until rate limits are adjusted (potentially by aws
> dynamically dynamo's internal partitions?)
> 2. make sure to page scans if you have a lot of values when doing lists
> (we haven't enabled catalog listing yet, but we've ran into this before)
>
> We chose dynamo because we were using it for other usecases. I'm not sure
> if it's the best aws provided option for atomic changes.
>
> John
>
> On 11/19/20 10:07 AM, Marko Babic wrote:
>
> Hi everyone,
>
> At my org we’ve spun up a few Iceberg tables on top of S3 without a
> metastore (conscious of the consequences) and we’ve arrived at the point
> that we need to support concurrent writes. :) I was hoping to get some
> advice as to what the best way to integrate an existing Iceberg table into
> a Hive Metastore or an alternative might be. We’re still relatively early
> in our adoption of Iceberg and have no real prior experience with Hive so I
> don’t know what I don’t know.
>
> Some options we’re weighing:
>
>   - Existing tables aren’t so big that the moral equivalent of "CREATE
> TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the
> question, but we’d prefer to not have to read + rewrite everything. We also
> have stateful readers (tracking which snapshots they have previously read)
> and preserving table history would make life easier.
>
>   - Doing something along the lines of the following and importing the
> tables into Hive as external tables looks like it should work given my
> understanding of how Iceberg is using HMS, but I don’t know if it’s
> encouraged and I haven’t done diligence to understand potential
> consequences:
>
> ```
> hive> CREATE EXTERNAL TABLE `existing_table` (...)
> LOCATION
>   's3://existing-table/'
> -- serde, input/output formats omitted
> TBLPROPERTIES (
>   -- Assuming latest metadata file for Hadoop table is v99.metadata.json,
> rename it to 00099-uuid.metadata.json
>   -- so that BaseMetastoreTableOperations can correctly parse the version
> number.
> 'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json
> ',
>   'table_type'='ICEBERG'
> )
> ```
>
>   - Others seem to have had success implementing + maintaining a custom
> catalog (https://iceberg.apache.org/custom-catalog/ <
> https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB for
> atomic metadata updates, which could appeal to us. Seems like migration in
> this case consists of implementing the catalog and plopping the latest
> metadata into the backing store. Are custom catalogs more of an escape
> hatch when HMS can’t be used, or would that maybe be a reasonable way
> forward if we find we don’t want to maintain + operate on top of HMS?
>
> Apologies if this was discussed or documented somewhere else and I’ve
> missed it.
>
> Thanks!
>
> Marko
>
>
>

Re: Integrating Existing Iceberg Tables with a Metastore

Reply via email to