Re: [DISCUSS] Run GC with Catalog or Tables

Jack Ye Thu, 07 Dec 2023 11:23:17 -0800

Regarding the 2 points Yufei brought up, in the inventory list way, I think
it would offer the following experience:


*1. Do we retry from the beginning or from the middle if the procedure
failed in the middle while expiring snapshots for one catalog? If we
started from the beginning, some tables may never get GCed.*
We need the read of files metadata table for each table to work. If any
read does not work, the file diff cannot be derived, thus deletion will not
run. This avoids accidentally deleting data in failed table. If the
deletion process somehow fails, that is fine because the next execution can
catch up.

*2. Users may be more interested in some tables getting GCed instead of the
whole catalog. How do we report the results back to users?*
This mode requires expiring all tables in the same bucket, unless we start
to enforce some concept of table location ownership. But it seems like we
have been thinking about that for format v3 already.

Of course what I describe is more about removing orphan files. In this
approach, snapshot expiration will not delete any data file, just delete
reference of the expired snapshots and manifests in the table metadata
tree. That makes snapshot expiration relatively lightweight, and a batch
execution could be less risky.

-Jack

On Thu, Dec 7, 2023 at 10:50 AM Jack Ye <yezhao...@gmail.com> wrote:

> Running GC across the entire catalog has always been something I want to
> explore, because of one particular benefit:
>
> People typically use one S3 bucket for all tables in a catalog, and you
> can run a JOIN of the union of all files metadata table against the S3
> inventory list
> <https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html>
> to figure out file diff, and avoid time-consuming file listing during
> orphan file removal. This process can be fully distributed and avoid OOM
> errors.
>
> I had an issue tracking it in the past:
> https://github.com/apache/iceberg/issues/7111
>
> Any thoughts?
>
> -Jack
>
> On Thu, Dec 7, 2023 at 10:40 AM Yufei Gu <flyrain...@gmail.com> wrote:
>
>> Error report and Retry are tricky.
>> 1. Do we retry from the beginning or from the middle if the procedure
>> failed in the middle while expiring snapshots for one catalog? If we
>> started from the beginning, some tables may never get GCed.
>> 2. Users may be more interested in some tables getting GCed instead of
>> the whole catalog. How do we report the results back to users?
>>
>> I feel like it becomes a service eventually instead of a single procedure
>> considering the complexity. That's also what the
>> Iceberg infrastructure teams usually do.
>>
>> Yufei
>>
>>
>> On Wed, Dec 6, 2023 at 11:21 PM Naveen Kumar <nk1...@gmail.com> wrote:
>>
>>> My concern with the per-catalog approach is that people might
>>>> accidentally run it. Do you think it's clear enough that these invocations
>>>> will drop older snapshots?
>>>>
>>> As @Andrea has mentioned, the existing implementation of expire
>>> snapshots on every table should help.
>>>
>>> I just think this is a bit more complicated than I want to take into the
>>>> main library just because we have to make decisions about
>>>>
>>>> 1. Retries
>>>> 2. Concurrency
>>>> 3. Results/Error Reporting
>>>>
>>>
>>> I haven't brainstormed much about implementation as of now. But as you
>>> mentioned the bigger challenge is around *Concurrency, *any mishandling
>>> of the same can cause OOM. Also it would be difficult to *report* all
>>> the failed tables/snapshots.
>>>
>>> Also iceberg catalog supports nested namespace, so maybe we need to
>>>> consider more general syntax for only database, table levels.
>>>>
>>> Agreed.
>>>
>>> I started this discussion to take the opinions of individuals on the
>>> idea of GC per catalog. If this does sound a good use case, I can start
>>> spending time around the complexity and challenges.
>>>
>>> Please advise.
>>>
>>> Regards,
>>> Naveen Kumar
>>>
>>>
>>>
>>> On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com>
>>> wrote:
>>>
>>>> Also iceberg catalog supports nested namespace, so maybe we need to
>>>> consider more general syntax for only database, table levels.
>>>>
>>>> On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I just think this is a bit more complicated than I want to take into
>>>>> the main library just because we have to make decisions about
>>>>>
>>>>> 1. Retries
>>>>> 2. Concurrency
>>>>> 3. Results/Error Reporting
>>>>>
>>>>> But if we have a good proposal for we will handle all those I think we
>>>>> could do it?
>>>>>
>>>>> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I think that if you call an expire snapshots function this is exactly
>>>>> what you want
>>>>>
>>>>> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> My concern with the per-catalog approach is that people might
>>>>>> accidentally run it. Do you think it's clear enough that these 
>>>>>> invocations
>>>>>> will drop older snapshots?
>>>>>>
>>>>>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi <
>>>>>> acampolon...@gmail.com> wrote:
>>>>>>
>>>>>>> I like this approach. + 1
>>>>>>>
>>>>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> Currently Spark-Procedures supports 
>>>>>>> *expire_snapshots/remove_orphan_files
>>>>>>> *per table.
>>>>>>>
>>>>>>> Today, if someone has to run GCs on an entire catalog they will have
>>>>>>> to manually run these procedures for every table.
>>>>>>>
>>>>>>> Is it a good idea to do it in bulk as per catalog or with multiple
>>>>>>> tables ?
>>>>>>>
>>>>>>> Current syntax:
>>>>>>>
>>>>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>)
>>>>>>>
>>>>>>> Proposed Syntax something similar:
>>>>>>>
>>>>>>> Per Namespace/Database
>>>>>>>
>>>>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>)
>>>>>>>
>>>>>>> Per Catalog
>>>>>>>
>>>>>>> CALL hive_prod.system.expire_snapshots(<Options>)
>>>>>>>
>>>>>>> Multiple Tables
>>>>>>>
>>>>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
>>>>>>> 'db2.table2), <Options>)
>>>>>>>
>>>>>>> PS: There could be exceptions for individual catalogs. Like Nessie
>>>>>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the
>>>>>>> Namespaces.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Naveen Kumar
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Run GC with Catalog or Tables

Reply via email to