Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Ryan Blue Mon, 03 Jun 2019 14:01:04 -0700

I opened a PR for appending manifests:
https://github.com/apache/incubator-iceberg/pull/201


On Mon, Jun 3, 2019 at 12:32 PM Ryan Blue <rb...@netflix.com> wrote:

> Yes, we will need to expose ManifestWriter, but only the methods that work
> with DataFile because we only need to support append.
>
> Unfortunately, these manifests will need to be rewritten because they
> don't have the correct snapshot ID in the file metadata because that is set
> in the final commit.
>
> Iceberg doesn't support Spark bucketing because they use different bucket
> functions. We will need to extend Spark to allow using a custom partition
> function to get bucketed joins working.
>
> On Mon, Jun 3, 2019 at 12:18 PM Anton Okolnychyi <aokolnyc...@apple.com>
> wrote:
>
>> If we are to support appending manifest files, do we expect to expose
>> ManifestWriter?
>>
>> Also, one more question about migrating bucketed Spark tables. Am I
>> correct it won’t work because of [1]? The bucketing field won’t be present
>> in the partition values map, as bucket ids are encoded in file names, right?
>>
>> [1] -
>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala#L107
>>
>> On 20 May 2019, at 11:17, Anton Okolnychyi <aokolnyc...@apple.com> wrote:
>>
>> A few comments from me inline:
>>
>> I think it is reasonable to make this a Spark job. The number of files in
>> tables we convert typically requires it. This would only be too much for
>> the driver if all of the files are collected at one time. We commit 500,000
>> files per batch, which seems to work well. That trades atomicity, though.
>>
>>
>> Makes sense to me, just wanted to double-check that the “proper” API
>> should still be based on a Spark job.
>>
>> - What should be the scope? Do we want the migration to modify the
>> catalog?
>>
>> What do you mean modify the catalog?
>>
>> We've built two SQL commands using these helper classes and methods:
>> SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that
>> basically builds a snapshot, but also renames the new table into place.
>>
>>
>> By modifying catalog I mean creating/dropping tables in it. My question
>> is whether this should be part of the migration tool/API or not. Right now,
>> it isn’t. Instead, the catalog is modified by SNAPSHOT TABLE and MIGRATE
>> TABLE commands that are using SparkTableUtil. I think is reasonable to keep
>> it this way, meaning that the scope of the migration tool is to find files
>> and append them to an existing Iceberg table.
>>
>> - If we stick to having Dataset[DataFiles], we will have one append per
>> dataset partition. How do we want to guarantee that either all files are
>> appended or none? One way is to rely on temporary tables but this requires
>> looking into the catalog. Another approach is to say that if the
>> migration isn’t successful, users will need to remove the metadata and
>> retry.
>>
>> I don't recommend committing from tasks. If Spark runs speculative
>> attempts, you can get duplicate data.
>>
>>
>> That’s exactly my point. In addition, if one task fails, then we end up
>> with partial results. However, it seems to be the only possible way right
>> now. We can mitigate the effects by having temporary tables but this
>> doesn’t solve the problem completely.
>>
>> Another option is to add a way to append manifest files, not just data
>> files. That way, each task produces a manifest and we can create a single
>> commit to add all of the manifests. This requires some rewriting, but I
>> think it would be a good way to distribute the majority of the work and
>> still have an atomic commit.
>>
>>
>> This sounds promising but requires quite some changes to the public API.
>> I am wondering if we can group all files per Dataset partition and
>> leverage ReplacePartitions/OverwriteFiles APIs in Iceberg. That way, the
>> migration won’t be atomic but idempotent.
>>
>> Thanks,
>> Anton
>>
>> On 15 May 2019, at 19:12, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Replies inline:
>>
>> On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi <aokolnyc...@apple.com>
>> wrote:
>> I would like to resume this topic. How do we see the proper API for
>> migration?
>>
>> I have a couple of questions in mind:
>> - Now, it is based on a Spark job. Do we want to keep it that way because
>> the number of files might be huge? Will it be too much for the driver?
>>
>> I think it is reasonable to make this a Spark job. The number of files in
>> tables we convert typically requires it. This would only be too much for
>> the driver if all of the files are collected at one time. We commit 500,000
>> files per batch, which seems to work well. That trades atomicity, though.
>>
>> - What should be the scope? Do we want the migration to modify the
>> catalog?
>>
>> What do you mean modify the catalog?
>>
>> We've built two SQL commands using these helper classes and methods:
>> SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that
>> basically builds a snapshot, but also renames the new table into place.
>>
>> - If we stick to having Dataset[DataFiles], we will have one append per
>> dataset partition. How do we want to guarantee that either all files are
>> appended or none? One way is to rely on temporary tables but this requires
>> looking into the catalog. Another approach is to say that if the
>> migration isn’t successful, users will need to remove the metadata and
>> retry.
>>
>> I don't recommend committing from tasks. If Spark runs speculative
>> attempts, you can get duplicate data.
>>
>> Another option is to add a way to append manifest files, not just data
>> files. That way, each task produces a manifest and we can create a single
>> commit to add all of the manifests. This requires some rewriting, but I
>> think it would be a good way to distribute the majority of the work and
>> still have an atomic commit.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to