I definitely support this idea. Having a clean and reliable API to migrate existing Spark tables to Iceberg will be helpful. I propose to collect all requirements for the new API in this thread. Then I can come up with a doc that we will discuss within the community.
From the feature perspective, I think it would be important to support tables that persist partition information in HMS as well as tables that derive partition information from the folder structure. Also, migrating just a partition of a table would be useful. > On 18 Mar 2019, at 18:28, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > I think that would be fine, but I want to throw out a quick warning: > SparkTableUtil was initially written as a few handy helpers, so it wasn't > well designed as an API. It's really useful, so I can understand wanting to > extend it. But should we come up with a real API for these conversion tasks > instead of updating the hacks? > > On Mon, Mar 18, 2019 at 11:11 AM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > Hi, > > SparkTableUtil can be helpful for migrating existing Spark tables into > Iceberg. Right now, SparkTableUtil assumes that the partition information is > always tracked in Hive metastore. > > What about extending SparkTableUtil to handle Spark tables that don’t rely on > Hive metastore? I have a local prototype that makes use of Spark > InMemoryFileIndex to infer the partitioning info. > > Thanks, > Anton > > > -- > Ryan Blue > Software Engineer > Netflix