Re: Integrating Existing Iceberg Tables with a Metastore

Peter Vary Fri, 20 Nov 2020 00:39:16 -0800

Hi Marko,

The command you mention below: `CREATE EXTERNAL TABLE` above an existing 
Iceberg table will not transfer the "responsibility" of tracking the snapshot 
to HMS. It only creates a HMS external table which will allow Hive queries to 
read the given table. If you want to track the snapshot in the HMS then you 
have to originally create a table in HMS using HiveCatalog.


What I would do is this:
Create a new Iceberg table in a catalog which supports concurrent writes 
(Hive/Hadoop/Custom)
Migrate the tables to the new catalog. Maybe there are some already existing 
tools there, or with some java/spark code the snapshot files can be read and 
rewritten. By my understanding you definitely do not have to rewrite the data 
files, just the snapshot files (and maybe the manifest files)
Hope this helps,
Peter


> On Nov 19, 2020, at 21:29, John Clara <john.anthony.cl...@gmail.com> wrote:
> 
> Hi,
> 
> My team has been using the custom catalog along with atomic metadata updates 
> but we never migrated existing iceberg tables onto it. We also haven't turned 
> on integration with the hive catalog, so I'm not sure how
> 
> easy it is to plug in there (I think there was some recent work on that?). 
> Dynamo provides a local mock which you could combine with s3mock (check 
> iceberg tests) to test it out: 
> https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
>  
> <https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html>
>  
> 
> Only weird things we've run into with dynamo is:
> 1. it seems like we get rate limited by dynamo pretty hard when first writing 
> to a new table until rate limits are adjusted (potentially by aws dynamically 
> dynamo's internal partitions?)
> 2. make sure to page scans if you have a lot of values when doing lists (we 
> haven't enabled catalog listing yet, but we've ran into this before)
> 
> We chose dynamo because we were using it for other usecases. I'm not sure if 
> it's the best aws provided option for atomic changes.
> 
> John
> 
> On 11/19/20 10:07 AM, Marko Babic wrote:
>> Hi everyone,
>> 
>> At my org we’ve spun up a few Iceberg tables on top of S3 without a 
>> metastore (conscious of the consequences) and we’ve arrived at the point 
>> that we need to support concurrent writes. :) I was hoping to get some 
>> advice as to what the best way to integrate an existing Iceberg table into a 
>> Hive Metastore or an alternative might be. We’re still relatively early in 
>> our adoption of Iceberg and have no real prior experience with Hive so I 
>> don’t know what I don’t know.
>> 
>> Some options we’re weighing:
>> 
>>   - Existing tables aren’t so big that the moral equivalent of "CREATE TABLE 
>> hive.db.table … AS SELECT * FROM hadoop.table" is out of the question, but 
>> we’d prefer to not have to read + rewrite everything. We also have stateful 
>> readers (tracking which snapshots they have previously read) and preserving 
>> table history would make life easier.
>> 
>>   - Doing something along the lines of the following and importing the 
>> tables into Hive as external tables looks like it should work given my 
>> understanding of how Iceberg is using HMS, but I don’t know if it’s 
>> encouraged and I haven’t done diligence to understand potential consequences:
>> 
>> ```
>> hive> CREATE EXTERNAL TABLE `existing_table` (...)
>> LOCATION
>>   's3://existing-table/'
>> -- serde, input/output formats omitted
>> TBLPROPERTIES (
>>   -- Assuming latest metadata file for Hadoop table is v99.metadata.json, 
>> rename it to 00099-uuid.metadata.json
>>   -- so that BaseMetastoreTableOperations can correctly parse the version 
>> number.
>> 'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json',
>>   'table_type'='ICEBERG'
>> )
>> ```
>> 
>>   - Others seem to have had success implementing + maintaining a custom 
>> catalog (https://iceberg.apache.org/custom-catalog/ 
>> <https://iceberg.apache.org/custom-catalog/> 
>> <https://iceberg.apache.org/custom-catalog/ 
>> <https://iceberg.apache.org/custom-catalog/>>) backed by e.g. DynamoDB for 
>> atomic metadata updates, which could appeal to us. Seems like migration in 
>> this case consists of implementing the catalog and plopping the latest 
>> metadata into the backing store. Are custom catalogs more of an escape hatch 
>> when HMS can’t be used, or would that maybe be a reasonable way forward if 
>> we find we don’t want to maintain + operate on top of HMS?
>> 
>> Apologies if this was discussed or documented somewhere else and I’ve missed 
>> it.
>> 
>> Thanks!
>> 
>> Marko

Re: Integrating Existing Iceberg Tables with a Metastore

Reply via email to