Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Anton Okolnychyi Mon, 18 Mar 2019 14:22:30 -0700

I definitely support this idea. Having a clean and reliable API to migrate 
existing Spark tables to Iceberg will be helpful.
I propose to collect all requirements for the new API in this thread. Then I 
can come up with a doc that we will discuss within the community.


From the feature perspective, I think it would be important to support tables 
that persist partition information in HMS as well as tables that derive 
partition information from the folder structure. Also, migrating just a 
partition of a table would be useful. 


> On 18 Mar 2019, at 18:28, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> I think that would be fine, but I want to throw out a quick warning: 
> SparkTableUtil was initially written as a few handy helpers, so it wasn't 
> well designed as an API. It's really useful, so I can understand wanting to 
> extend it. But should we come up with a real API for these conversion tasks 
> instead of updating the hacks?
> 
> On Mon, Mar 18, 2019 at 11:11 AM Anton Okolnychyi 
> <aokolnyc...@apple.com.invalid> wrote:
> Hi,
> 
> SparkTableUtil can be helpful for migrating existing Spark tables into 
> Iceberg. Right now, SparkTableUtil assumes that the partition information is 
> always tracked in Hive metastore.
> 
> What about extending SparkTableUtil to handle Spark tables that don’t rely on 
> Hive metastore? I have a local prototype that makes use of Spark 
> InMemoryFileIndex to infer the partitioning info.
> 
> Thanks,
> Anton
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to