Re: Integrating Existing Iceberg Tables with a Metastore

Jacques Nadeau Fri, 20 Nov 2020 11:12:00 -0800

FYI, I would avoid adopting HMS because you need a better catalog. While
the HMS Iceberg catalog is mature, you're adopting something (HMS) that
carries a lot of baggage. I'd look at the other catalogs that are up and
coming if you can.


For example, Nessie (projectnessie.org) was built to provide a cloud native
approach to Iceberg transaction arbitration (along with some other nifty
features around cross-table transactions and git semantics) so that people
who work in the cloud but don't use Hive metastore, don't have to start.
The HA complexity, scaling dynamics and overall operational load of Nessie
is targeted to be a fraction of what HMS is.

Full disclosure, I work on Nessie.

Foot for thought, anyway.

--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Fri, Nov 20, 2020 at 10:58 AM Marko Babic <ma...@narrative.io.invalid>
wrote:

> Hi Peter. Thanks for responding.
>
> > The command you mention below: `CREATE EXTERNAL TABLE` above an existing
> Iceberg table will not transfer the "responsibility" of tracking the
> snapshot to HMS. It only creates a HMS external table ...
>
> So my understanding is that the HiveCatalog is basically just using HMS as
> an atomically updateable pointer to a metadata file (excepting recent work
> to make Iceberg tables queryable _from_ Hive which we won't be doing) so
> what I'm doing with that command is mimicking the DDL for a
> HiveCatalog-created table, which sets up Iceberg tables as external
> tables in HMS
> <https://github.com/apache/iceberg/blob/f8c68ebcb4e35db5d7f5ccb8e20d53df3abdf8b1/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L230-L292>,
>  and
> manually updating the `metadata_location` table property to point to latest
> metadata file for the existing table that I want to integrate with HMS.
> Updating the metadata pointer, along with obviously updating all
> readers/writers to load the table via the HiveCatalog, seems to be all I
> need to do to make that work but I'm just naively dipping my toes in here
> and could absolutely be missing something. E.g. I figured out I'd have to
> rename the latest metadata file from the existing table I want to integrate
> with HMS so that BaseMetastoreTableOperations could parse the version
> number, but only realized later that I'd have to rename _all_ the old
> metadata files + twiddle the metadata log entries to use the updated names.
>
> > What I would do is this: ...
>
> Makes sense, the external table creation + metadata pointer mangling is my
> attempt to do basically this but I'm not confident I know everything that
> needs to go into making step 2 happen. :)
>
> The following is what I'm thinking:
>
> - Given an existing Hadoop table on top of S3 at s3://old_table/, create a
> new table with the same schema + partition spec via HiveCatalog.
> - Parse metadata files from the old table and update them to be
> HadoopCatalog-compatible: all I'd be updating is metadata file names + the
> metadata log as described above.
> - Write updated metadata files to s3://old_table/metadata/. Update new
> table in HMS to point to latest, updated metadata file and update the table
> location to point to s3://old_table/.
>
> I could alternatively `aws s3 sync` data files from the old table to the
> new one, rewrite all the old metadata + snapshot manifest lists + manifest
> files to point to the new data directory, and leave s3://old_table/
> untouched, but I guess that's a decision I'd make once I'm into things and
> have a better sense of what'd be less error-prone.
>
> Thanks again!
>
> Marko
>
>
> On Fri, Nov 20, 2020 at 12:39 AM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Marko,
>>
>> The command you mention below: `CREATE EXTERNAL TABLE` above an existing
>> Iceberg table will not transfer the "responsibility" of tracking the
>> snapshot to HMS. It only creates a HMS external table which will allow Hive
>> queries to read the given table. If you want to track the snapshot in the
>> HMS then you have to originally create a table in HMS using HiveCatalog.
>>
>> What I would do is this:
>>
>>    1. Create a new Iceberg table in a catalog which supports concurrent
>>    writes (Hive/Hadoop/Custom)
>>    2. Migrate the tables to the new catalog. Maybe there are some
>>    already existing tools there, or with some java/spark code the snapshot
>>    files can be read and rewritten. By my understanding you definitely do not
>>    have to rewrite the data files, just the snapshot files (and maybe the
>>    manifest files)
>>
>> Hope this helps,
>> Peter
>>
>>
>> On Nov 19, 2020, at 21:29, John Clara <john.anthony.cl...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> My team has been using the custom catalog along with atomic metadata
>> updates but we never migrated existing iceberg tables onto it. We also
>> haven't turned on integration with the hive catalog, so I'm not sure how
>>
>> easy it is to plug in there (I think there was some recent work on
>> that?). Dynamo provides a local mock which you could combine with s3mock
>> (check iceberg tests) to test it out:
>> https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
>>
>>
>> Only weird things we've run into with dynamo is:
>> 1. it seems like we get rate limited by dynamo pretty hard when first
>> writing to a new table until rate limits are adjusted (potentially by aws
>> dynamically dynamo's internal partitions?)
>> 2. make sure to page scans if you have a lot of values when doing lists
>> (we haven't enabled catalog listing yet, but we've ran into this before)
>>
>> We chose dynamo because we were using it for other usecases. I'm not sure
>> if it's the best aws provided option for atomic changes.
>>
>> John
>>
>> On 11/19/20 10:07 AM, Marko Babic wrote:
>>
>> Hi everyone,
>>
>> At my org we’ve spun up a few Iceberg tables on top of S3 without a
>> metastore (conscious of the consequences) and we’ve arrived at the point
>> that we need to support concurrent writes. :) I was hoping to get some
>> advice as to what the best way to integrate an existing Iceberg table into
>> a Hive Metastore or an alternative might be. We’re still relatively early
>> in our adoption of Iceberg and have no real prior experience with Hive so I
>> don’t know what I don’t know.
>>
>> Some options we’re weighing:
>>
>>   - Existing tables aren’t so big that the moral equivalent of "CREATE
>> TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the
>> question, but we’d prefer to not have to read + rewrite everything. We also
>> have stateful readers (tracking which snapshots they have previously read)
>> and preserving table history would make life easier.
>>
>>   - Doing something along the lines of the following and importing the
>> tables into Hive as external tables looks like it should work given my
>> understanding of how Iceberg is using HMS, but I don’t know if it’s
>> encouraged and I haven’t done diligence to understand potential
>> consequences:
>>
>> ```
>> hive> CREATE EXTERNAL TABLE `existing_table` (...)
>> LOCATION
>>   's3://existing-table/'
>> -- serde, input/output formats omitted
>> TBLPROPERTIES (
>>   -- Assuming latest metadata file for Hadoop table is
>> v99.metadata.json, rename it to 00099-uuid.metadata.json
>>   -- so that BaseMetastoreTableOperations can correctly parse the
>> version number.
>> 'metadata_location'='
>> s3://existing-table/metadata/00099-uuid.metadata.json',
>>   'table_type'='ICEBERG'
>> )
>> ```
>>
>>   - Others seem to have had success implementing + maintaining a custom
>> catalog (https://iceberg.apache.org/custom-catalog/ <
>> https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB for
>> atomic metadata updates, which could appeal to us. Seems like migration in
>> this case consists of implementing the catalog and plopping the latest
>> metadata into the backing store. Are custom catalogs more of an escape
>> hatch when HMS can’t be used, or would that maybe be a reasonable way
>> forward if we find we don’t want to maintain + operate on top of HMS?
>>
>> Apologies if this was discussed or documented somewhere else and I’ve
>> missed it.
>>
>> Thanks!
>>
>> Marko
>>
>>
>>

Re: Integrating Existing Iceberg Tables with a Metastore

Reply via email to