Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-06-03 Thread Ryan Blue
I opened a PR for appending manifests: https://github.com/apache/incubator-iceberg/pull/201 On Mon, Jun 3, 2019 at 12:32 PM Ryan Blue wrote: > Yes, we will need to expose ManifestWriter, but only the methods that work > with DataFile because we only need to support append. > > Unfortunately, the

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-06-03 Thread Ryan Blue
Yes, we will need to expose ManifestWriter, but only the methods that work with DataFile because we only need to support append. Unfortunately, these manifests will need to be rewritten because they don't have the correct snapshot ID in the file metadata because that is set in the final commit. I

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-06-03 Thread Anton Okolnychyi
If we are to support appending manifest files, do we expect to expose ManifestWriter? Also, one more question about migrating bucketed Spark tables. Am I correct it won’t work because of [1]? The bucketing field won’t be present in the partition values map, as bucket ids are encoded in file nam

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-05-20 Thread Anton Okolnychyi
A few comments from me inline: > I think it is reasonable to make this a Spark job. The number of files in > tables we convert typically requires it. This would only be too much for the > driver if all of the files are collected at one time. We commit 500,000 files > per batch, which seems to w

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-05-15 Thread Ryan Blue
Replies inline: On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi wrote: > I would like to resume this topic. How do we see the proper API for > migration? > > I have a couple of questions in mind: > - Now, it is based on a Spark job. Do we want to keep it that way because > the number of files m

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-05-14 Thread Anton Okolnychyi
dobe | > xcoll...@adobe.com <mailto:xcoll...@adobe.com> > > > From: mailto:aokolnyc...@apple.com>> on behalf of > Anton Okolnychyi > Reply-To: "dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>" > mailto:dev@iceberg.apache.org>> >

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-20 Thread Ryan Blue
; for incentivizing Iceberg adoption and/or PoCs. >>>> >>>> >>>> >>>> *Xabriel J Collazo Mojica* | Senior Software Engineer | Adobe | >>>> xcoll...@adobe.com >>>> >>>> >>>> >>>> *From: * on

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-19 Thread Sandeep Nayak
doption and/or PoCs. >>> >>> >>> >>> *Xabriel J Collazo Mojica* | Senior Software Engineer | Adobe | >>> xcoll...@adobe.com >>> >>> >>> >>> *From: * on behalf of Anton Okolnychyi >>> >>> *Reply-To:

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-19 Thread Ryan Blue
Software Engineer | Adobe | >> xcoll...@adobe.com >> >> >> >> *From: * on behalf of Anton Okolnychyi >> >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Monday, March 18, 2019 at 2:22 PM >> *To: *"dev@iceberg.apache.org" ,

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-18 Thread Sandeep Nayak
t; xcoll...@adobe.com > > > > *From: * on behalf of Anton Okolnychyi > > *Reply-To: *"dev@iceberg.apache.org" > *Date: *Monday, March 18, 2019 at 2:22 PM > *To: *"dev@iceberg.apache.org" , Ryan Blue < > rb...@netflix.com> > *Subject: *Re: Ex

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-18 Thread Xabriel Collazo Mojica
| Senior Software Engineer | Adobe | xcoll...@adobe.com From: on behalf of Anton Okolnychyi Reply-To: "dev@iceberg.apache.org" Date: Monday, March 18, 2019 at 2:22 PM To: "dev@iceberg.apache.org" , Ryan Blue Subject: Re: Extend SparkTableUtil to Handle Tables Not Tracked i

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-18 Thread Anton Okolnychyi
I definitely support this idea. Having a clean and reliable API to migrate existing Spark tables to Iceberg will be helpful. I propose to collect all requirements for the new API in this thread. Then I can come up with a doc that we will discuss within the community. From the feature perspective

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-18 Thread Ryan Blue
I think that would be fine, but I want to throw out a quick warning: SparkTableUtil was initially written as a few handy helpers, so it wasn't well designed as an API. It's really useful, so I can understand wanting to extend it. But should we come up with a real API for these conversion tasks inst

Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

2019-03-18 Thread Anton Okolnychyi
Hi, SparkTableUtil can be helpful for migrating existing Spark tables into Iceberg. Right now, SparkTableUtil assumes that the partition information is always tracked in Hive metastore. What about extending SparkTableUtil to handle Spark tables that don’t rely on Hive metastore? I have a local