Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Anton Okolnychyi Mon, 03 Jun 2019 12:19:28 -0700

If we are to support appending manifest files, do we expect to expose 
ManifestWriter?


Also, one more question about migrating bucketed Spark tables. Am I correct it 
won’t work because of [1]? The bucketing field won’t be present in the 
partition values map, as bucket ids are encoded in file names, right?

[1] - 
https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala#L107
 
<https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala#L107>

> On 20 May 2019, at 11:17, Anton Okolnychyi <aokolnyc...@apple.com> wrote:
> 
> A few comments from me inline:
> 
>> I think it is reasonable to make this a Spark job. The number of files in 
>> tables we convert typically requires it. This would only be too much for the 
>> driver if all of the files are collected at one time. We commit 500,000 
>> files per batch, which seems to work well. That trades atomicity, though.
> 
> Makes sense to me, just wanted to double-check that the “proper” API should 
> still be based on a Spark job. 
> 
>> - What should be the scope? Do we want the migration to modify the catalog? 
>> 
>> What do you mean modify the catalog?
>> 
>> We've built two SQL commands using these helper classes and methods: 
>> SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that 
>> basically builds a snapshot, but also renames the new table into place.
> 
> By modifying catalog I mean creating/dropping tables in it. My question is 
> whether this should be part of the migration tool/API or not. Right now, it 
> isn’t. Instead, the catalog is modified by SNAPSHOT TABLE and MIGRATE TABLE 
> commands that are using SparkTableUtil. I think is reasonable to keep it this 
> way, meaning that the scope of the migration tool is to find files and append 
> them to an existing Iceberg table.
> 
>> - If we stick to having Dataset[DataFiles], we will have one append per 
>> dataset partition. How do we want to guarantee that either all files are 
>> appended or none? One way is to rely on temporary tables but this requires 
>> looking into the catalog. Another approach is to say that if the migration 
>> isn’t successful, users will need to remove the metadata and retry.
>> 
>> I don't recommend committing from tasks. If Spark runs speculative attempts, 
>> you can get duplicate data.
> 
> That’s exactly my point. In addition, if one task fails, then we end up with 
> partial results. However, it seems to be the only possible way right now. We 
> can mitigate the effects by having temporary tables but this doesn’t solve 
> the problem completely.
> 
>> Another option is to add a way to append manifest files, not just data 
>> files. That way, each task produces a manifest and we can create a single 
>> commit to add all of the manifests. This requires some rewriting, but I 
>> think it would be a good way to distribute the majority of the work and 
>> still have an atomic commit.
> 
> This sounds promising but requires quite some changes to the public API. I am 
> wondering if we can group all files per Dataset partition and leverage 
> ReplacePartitions/OverwriteFiles APIs in Iceberg. That way, the migration 
> won’t be atomic but idempotent.
> 
> Thanks,
> Anton
> 
>> On 15 May 2019, at 19:12, Ryan Blue <rb...@netflix.com> wrote:
>> 
>> Replies inline:
>> 
>> On Tue, May 14, 2019 at 3:21 AM Anton Okolnychyi <aokolnyc...@apple.com> 
>> wrote:
>> I would like to resume this topic. How do we see the proper API for 
>> migration?
>> 
>> I have a couple of questions in mind:
>> - Now, it is based on a Spark job. Do we want to keep it that way because 
>> the number of files might be huge? Will it be too much for the driver?
>> 
>> I think it is reasonable to make this a Spark job. The number of files in 
>> tables we convert typically requires it. This would only be too much for the 
>> driver if all of the files are collected at one time. We commit 500,000 
>> files per batch, which seems to work well. That trades atomicity, though.
>>  
>> - What should be the scope? Do we want the migration to modify the catalog? 
>> 
>> What do you mean modify the catalog?
>> 
>> We've built two SQL commands using these helper classes and methods: 
>> SNAPSHOT TABLE that creates a new table copy, and MIGRATE TABLE that 
>> basically builds a snapshot, but also renames the new table into place.
>>  
>> - If we stick to having Dataset[DataFiles], we will have one append per 
>> dataset partition. How do we want to guarantee that either all files are 
>> appended or none? One way is to rely on temporary tables but this requires 
>> looking into the catalog. Another approach is to say that if the migration 
>> isn’t successful, users will need to remove the metadata and retry.
>> 
>> I don't recommend committing from tasks. If Spark runs speculative attempts, 
>> you can get duplicate data.
>> 
>> Another option is to add a way to append manifest files, not just data 
>> files. That way, each task produces a manifest and we can create a single 
>> commit to add all of the manifests. This requires some rewriting, but I 
>> think it would be a good way to distribute the majority of the work and 
>> still have an atomic commit.
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>

Re: Extend SparkTableUtil to Handle Tables Not Tracked in Hive Metastore

Reply via email to