Re: Spark: Copy Table Action

Pucheng Yang Thu, 20 Feb 2025 07:11:39 -0800

Hi all, thank you very much for the progress so far, I believe this is very
helpful to DR and other use cases that require copying the table from place
to place. I can see we support rewriting table paths in the metadata files
now, I wonder if you have a plan for the next step to have a more full
integration (such as copy table end to end)? Thanks


On Thu, Aug 15, 2024 at 2:13 PM Yufei Gu <flyrain...@gmail.com> wrote:

> Sorry for the late reply.
>
> > I was wondering if we also want to support the use case of moving tables
> in this proposal?
>
> Pucheng, yes, we could use the action to move tables.
>
> Hi Sumedh, here are my answers to your questions:
>
> > Should the copied table registered in same catalog as the source table,
> or they are copied in a different catalog for the destination table?
>
> It is fine to register the table within the same catalog with different
> table identifiers, as well as different table uuids, if your tools count on
> it.
>
> > Are we shooting for perfect query reproducibility for time travel
> queries across the source and copy table? I.e. Is the snapshot chain on the
> source table be maintained on the copied table?
>
> The action will support it. Although, it is also accepted if you want to
> copy from the middle of snapshot history, as it is common that users don't
> care about certain table history. Overall, users need to make the decision
> by themselves.
>
> > Is this a one-time copy action, or is this something we can run on a
> schedule, i.e. as new data is written to source table, incremental deltas
> (appends, updates, deletes) will be copied?
>
> It will support incremental copy so that you don't have to copy the whole
> table every time. It isn't practical to copy the whole table every time due
> to the large volume.
>
> These answers are also covered in the goal section of this design doc,
> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit#heading=h.97m5uqimprde
> .
>
> > Has the community considered an approach where the scheme and cluster is
> minted by the catalog, to be used in the respective FileIO implementation
> for the blob stores. For example, if we had a bucket foo on us-east, and
> bucket bar on us-west, the catalog running on us-east would mint s3://foo,
> and the catalog running on us-west would mint s3://bar, and the S3FileIO
> would join that with rest of the relative path to the object. This would
> allow us to capture the absolute path relative to s3://<bucket-name> in the
> Iceberg metadata?
>
> This is similar to S3 access point,
> https://aws.amazon.com/s3/features/access-points/. You can use it as an
> alternative if all your table storage locations are with s3.
>
>
> Yufei
>
>
> On Fri, Jul 12, 2024 at 10:09 AM Sumedh Sakdeo
> <ssak...@linkedin.com.invalid> wrote:
>
>> This is a useful addition, I believe it is important to list down
>> requirements for such an action in greater details, especially what is in
>> scope and what is not. Some open questions that could be added to the
>> requirements / non-requirements section are
>>
>>    1. Should the copied table registered in same catalog as the source
>>    table, or they are copied in a different catalog for the destination 
>> table?
>>       1. This has implications on the table identifier, and how the
>>       metadata is copied.
>>    2. Are we shooting for perfect query reproducibility for time travel
>>    queries across the source and copy table? I.e. Is the snapshot chain on 
>> the
>>    source table be maintained on the copied table?
>>       1. Spec talks about rebuilding metadata, but it would be clearer
>>       if it said if the entire snapshot chain was maintained or we are
>>       rebuilding metadata in a way that only data in current snapshot matches
>>       between source and destination.
>>    3. Is this a one-time copy action, or is this something we can run on
>>    a schedule, i.e. as new data is written to source table, incremental 
>> deltas
>>    (appends, updates, deletes) will be copied?
>>       1. Later has implications to consider as various maintenance jobs
>>       run on source and destination can diverge the snapshot chain.
>>
>>
>> At LinkedIn, we ran into the absolute v/s relative path issue when
>> designing snapshot replication for Iceberg tables. The way we approached it
>> is we use absolute path of the file in the metadata, without the scheme and
>> cluster. We use HadoopFileIO, the scheme and cluster is derived from the
>> Hadoop conf. For example, if the file path is,
>> hdfs://<cluster>/data/openhouse/db/tb_uuid, what is stored in Iceberg
>> metadata is /data/openhouse/db/tb_uuid, and hdfs://<cluster> comes from
>> Hadoop conf.
>>
>> Has the community considered an approach where the scheme and cluster is
>> minted by the catalog, to be used in the respective FileIO implementation
>> for the blob stores. For example, if we had a bucket foo on us-east, and
>> bucket bar on us-west, the catalog running on us-east would mint s3://foo,
>> and the catalog running on us-west would mint s3://bar, and the S3FileIO
>> would join that with rest of the relative path to the object. This would
>> allow us to capture the absolute path relative to s3://<bucket-name> in the
>> Iceberg metadata?
>>
>> Thanks,
>> -sumedh
>>
>> From: Pucheng Yang <pucheng.yo...@gmail.com>
>> Date: Thursday, July 11, 2024 at 8:15 AM
>> To: dev@iceberg.apache.org <dev@iceberg.apache.org>
>> Subject: Re: Spark: Copy Table Action
>>
>> Hi Yufei, I was wondering if we also want to support the use case of
>> moving tables in this proposal? For example, users might have various
>> reasons to change the table location, however, there is no good way to move
>> original data files to the new location unless we are doing data files
>> rewrite, but it seems that we are misusing the functionality.
>>
>> On Wed, Jul 10, 2024 at 9:37 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required
>>>> since orphan file removal should be able to cleanup the files in the
>>>> copied table. Are we able to elaborate why there's a concern with removing
>>>> snapshots on the copied table and subsequently relying on orphan file
>>>> removal on the copied table to remove the actual files? Is it around
>>>> listing?
>>>
>>>
>>> I have the same concern as Amogh. I already mentioned the same thing in
>>> the PR yesterday
>>> <https://github.com/apache/iceberg/pull/10643#discussion_r1669739401>
>>> .
>>> I suggested renaming it as *RemoveTableCopyOrphanFiles. *Thinking more
>>> on this today, I think we should atomically (implicitly) handle cleaning up
>>> of orphan files as part of copy table action instead of a separate action.
>>>
>>> Also, very happy to see the progress on this one. This will help users
>>> to move the data from one location to another seamlessly.
>>>
>>> - Ajantha
>>>
>>>
>>> On Wed, Jul 10, 2024 at 7:35 AM Amogh Jahagirdar <2am...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Yufei!
>>>>
>>>> +1 on having a copy table action, I think that's pretty valuable. I
>>>> have some ideas on interfaces based on previous work I've done for
>>>> region/multi-cloud replication of Iceberg tables. The absolute vs relative
>>>> path discussion is interesting, I have some questions on how relative
>>>> pathing would look like but I'll wait for Anurag's input.
>>>>
>>>> On CheckSnapshotIntegrity, I think I'd probably advocate for having a
>>>> more general "Repair Metadata" procedure. Currently, it looks like
>>>> CheckSnapshotIntegrity just tells a user what files are missing in its
>>>> output. I think we could go a step further and attempt to handle cases
>>>> where a manifest entry refers to a file which no longer exists. We could
>>>> attempt a recovery of that file if the fileIO implementation supports that
>>>> via some sort of a SupportsRecovery mixin. There's also another corruption
>>>> case where duplicate file entries end up in manifests, we can define an
>>>> approach on reconciling that and write out new manifests.
>>>> There's actually been two attempts on this, one from Szehon quite a
>>>> while back https://github.com/apache/iceberg/pull/2608 and another
>>>> more recently from Matt https://github.com/apache/iceberg/pull/10445 .
>>>> Perhaps we could review both of these and figure out a path forward for
>>>> this?
>>>> For just verifying the integrity of the copy table, we could have a dry
>>>> run option for the repair metadata operation which would output any missing
>>>> files, or manifests with duplicates without performing any recovery/fixing
>>>> up.
>>>>
>>>> For RemoveExpiredFiles, I'm admittedly a bit skeptical if it's required
>>>> since orphan file removal should be able to cleanup the files in the
>>>> copied table. Are we able to elaborate why there's a concern with removing
>>>> snapshots on the copied table and subsequently relying on orphan file
>>>> removal on the copied table to remove the actual files? Is it around
>>>> listing?
>>>>
>>>> Overall this is great to see.
>>>>
>>>> Thanks,
>>>> Amogh Jahagirdar
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 9, 2024 at 10:59 AM Anurag Mantripragada
>>>> <amantriprag...@apple.com.invalid> wrote:
>>>>
>>>>> Agreed with Peter. I will bring relative paths changes up in the next
>>>>> community sync. I will help drive this.
>>>>>
>>>>>
>>>>> ~ Anurag Mantripragada
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Jul 8, 2024, at 10:50 PM, Péter Váry <peter.vary.apa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I think in most cases the copy table action doesn't require a query
>>>>> engine to read and generate the new metadata files. This means, that it
>>>>> would be nice to provide a pure Java implementation in the core, and it
>>>>> could be extended/reused by different engines, like Spark, to execute it 
>>>>> in
>>>>> a distributed manner, when distributed execution is needed.
>>>>>
>>>>> About the copy vs. relative path debate:
>>>>> - I have seen the relative path requirement coming up multiple times
>>>>> in the past. Seems like a feature requested by multiple users, so I think
>>>>> it would be the best to discuss it in a different thread. The Copy Table
>>>>> Action might be used to move absolute path tables to relative path tables
>>>>> when migration is needed.
>>>>>
>>>>> On Mon, Jul 8, 2024, 21:52 Anurag Mantripragada
>>>>> <amantriprag...@apple.com.invalid> wrote:
>>>>>
>>>>>> Hi Yufei.
>>>>>>
>>>>>> Thanks for the proposal. While the actions are great, they still need
>>>>>> to do a lot of work which can be reduced if we have the relative path
>>>>>> changes. I still support adding these actions as moving data was out of
>>>>>> scope for the relative path design and we can use these actions as 
>>>>>> helpers
>>>>>> when the spec change is done.
>>>>>>
>>>>>> Anurag Mantripragada
>>>>>>
>>>>>> On Jul 8, 2024, at 10:55 AM, Pucheng Yang <pucheng.yo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Thanks for picking this up, I think this is a very valuable addition.
>>>>>>
>>>>>> On Mon, Jul 8, 2024 at 10:48 AM Yufei Gu <flyrain...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> I'd like to share a recent progress of adding actions to copy tables
>>>>>>> across different places.
>>>>>>>
>>>>>>> There is a constant need to copy tables across different places for
>>>>>>> purposes such as disaster recovery and testing. Due to the absolute file
>>>>>>> paths in Iceberg metadata, it doesn't work automatically. There are 
>>>>>>> three
>>>>>>> generic solutions:
>>>>>>> 1. Rebuild the metadata: This is a proven approach widely used
>>>>>>> across various companies.
>>>>>>> 2. S3 access point: Effective when both the source and target
>>>>>>> locations are in S3, but not applicable to other storage systems.
>>>>>>> 3. Relative path: It requires changes to the table specification.
>>>>>>>
>>>>>>> We focus on the first approach in this thread. While the code has
>>>>>>> been shared 2 years ago here
>>>>>>> <https://github.com/apache/iceberg/pull/4705>, it has never been
>>>>>>> merged. We picked it up recently. Here are the active PRs related to 
>>>>>>> this
>>>>>>> action. Would really appreciate any feedback and review:
>>>>>>>
>>>>>>>    - PR to add CopyTable action:
>>>>>>>    https://github.com/apache/iceberg/pull/10024
>>>>>>>    - PR to add CheckSnapshotIntegrity action:
>>>>>>>    https://github.com/apache/iceberg/pull/10642
>>>>>>>    - PR to add RemoveExpiredFiles action:
>>>>>>>    https://github.com/apache/iceberg/pull/10643
>>>>>>>
>>>>>>> Here is a google doc with more details to clarify the goals and
>>>>>>> approach:
>>>>>>> https://docs.google.com/document/d/15oPj7ylgWQG8bhk_5aTjzHl7mlc-9f4OAH-oEpKavSc/edit?usp=sharing
>>>>>>>
>>>>>>> Yufei
>>>>>>>
>>>>>>
>>>>>>
>>>>>

Re: Spark: Copy Table Action

Reply via email to