Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Xabriel Collazo Mojica Mon, 18 Mar 2019 15:20:32 -0700

+1 for having a tool/API to migrate tables from HMS into Iceberg.

We do not use HMS in my current project, but since HMS is the de facto catalog 
in most companies doing Hadoop, I think such a tool would be vital for 
incentivizing Iceberg adoption and/or PoCs.

Xabriel J Collazo Mojica  |  Senior Software Engineer  |  Adobe  |  
xcoll...@adobe.com

From: <aokolnyc...@apple.com> on behalf of Anton Okolnychyi 
<aokolnyc...@apple.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>
Date: Monday, March 18, 2019 at 2:22 PM
To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, Ryan Blue 
<rb...@netflix.com>
Subject: Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive 
Metastore

I definitely support this idea. Having a clean and reliable API to migrate 
existing Spark tables to Iceberg will be helpful.
I propose to collect all requirements for the new API in this thread. Then I 
can come up with a doc that we will discuss within the community.

From the feature perspective, I think it would be important to support tables 
that persist partition information in HMS as well as tables that derive 
partition information from the folder structure. Also, migrating just a 
partition of a table would be useful.

On 18 Mar 2019, at 18:28, Ryan Blue 
<rb...@netflix.com.INVALID<mailto:rb...@netflix.com.INVALID>> wrote:

I think that would be fine, but I want to throw out a quick warning: 
SparkTableUtil was initially written as a few handy helpers, so it wasn't well 
designed as an API. It's really useful, so I can understand wanting to extend 
it. But should we come up with a real API for these conversion tasks instead of 
updating the hacks?

On Mon, Mar 18, 2019 at 11:11 AM Anton Okolnychyi 
<aokolnyc...@apple.com.invalid<mailto:aokolnyc...@apple.com.invalid>> wrote:
Hi,

SparkTableUtil can be helpful for migrating existing Spark tables into Iceberg. 
Right now, SparkTableUtil assumes that the partition information is always 
tracked in Hive metastore.

What about extending SparkTableUtil to handle Spark tables that don’t rely on 
Hive metastore? I have a local prototype that makes use of Spark 
InMemoryFileIndex to infer the partitioning info.

Thanks,
Anton

--
Ryan Blue
Software Engineer
Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to