Re: [DISCUSS] Run GC with Catalog or Tables

Jack Ye Thu, 07 Dec 2023 10:50:42 -0800

Running GC across the entire catalog has always been something I want to
explore, because of one particular benefit:


People typically use one S3 bucket for all tables in a catalog, and you can
run a JOIN of the union of all files metadata table against the S3
inventory list
<https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html>
to figure out file diff, and avoid time-consuming file listing during
orphan file removal. This process can be fully distributed and avoid OOM
errors.

I had an issue tracking it in the past:
https://github.com/apache/iceberg/issues/7111

Any thoughts?

-Jack

On Thu, Dec 7, 2023 at 10:40 AM Yufei Gu <flyrain...@gmail.com> wrote:

> Error report and Retry are tricky.
> 1. Do we retry from the beginning or from the middle if the procedure
> failed in the middle while expiring snapshots for one catalog? If we
> started from the beginning, some tables may never get GCed.
> 2. Users may be more interested in some tables getting GCed instead of the
> whole catalog. How do we report the results back to users?
>
> I feel like it becomes a service eventually instead of a single procedure
> considering the complexity. That's also what the
> Iceberg infrastructure teams usually do.
>
> Yufei
>
>
> On Wed, Dec 6, 2023 at 11:21 PM Naveen Kumar <nk1...@gmail.com> wrote:
>
>> My concern with the per-catalog approach is that people might
>>> accidentally run it. Do you think it's clear enough that these invocations
>>> will drop older snapshots?
>>>
>> As @Andrea has mentioned, the existing implementation of expire snapshots
>> on every table should help.
>>
>> I just think this is a bit more complicated than I want to take into the
>>> main library just because we have to make decisions about
>>>
>>> 1. Retries
>>> 2. Concurrency
>>> 3. Results/Error Reporting
>>>
>>
>> I haven't brainstormed much about implementation as of now. But as you
>> mentioned the bigger challenge is around *Concurrency, *any mishandling
>> of the same can cause OOM. Also it would be difficult to *report* all
>> the failed tables/snapshots.
>>
>> Also iceberg catalog supports nested namespace, so maybe we need to
>>> consider more general syntax for only database, table levels.
>>>
>> Agreed.
>>
>> I started this discussion to take the opinions of individuals on the idea
>> of GC per catalog. If this does sound a good use case, I can start spending
>> time around the complexity and challenges.
>>
>> Please advise.
>>
>> Regards,
>> Naveen Kumar
>>
>>
>>
>> On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> Also iceberg catalog supports nested namespace, so maybe we need to
>>> consider more general syntax for only database, table levels.
>>>
>>> On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I just think this is a bit more complicated than I want to take into
>>>> the main library just because we have to make decisions about
>>>>
>>>> 1. Retries
>>>> 2. Concurrency
>>>> 3. Results/Error Reporting
>>>>
>>>> But if we have a good proposal for we will handle all those I think we
>>>> could do it?
>>>>
>>>> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com>
>>>> wrote:
>>>>
>>>> I think that if you call an expire snapshots function this is exactly
>>>> what you want
>>>>
>>>> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> My concern with the per-catalog approach is that people might
>>>>> accidentally run it. Do you think it's clear enough that these invocations
>>>>> will drop older snapshots?
>>>>>
>>>>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi <
>>>>> acampolon...@gmail.com> wrote:
>>>>>
>>>>>> I like this approach. + 1
>>>>>>
>>>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files
>>>>>> *per table.
>>>>>>
>>>>>> Today, if someone has to run GCs on an entire catalog they will have
>>>>>> to manually run these procedures for every table.
>>>>>>
>>>>>> Is it a good idea to do it in bulk as per catalog or with multiple
>>>>>> tables ?
>>>>>>
>>>>>> Current syntax:
>>>>>>
>>>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>)
>>>>>>
>>>>>> Proposed Syntax something similar:
>>>>>>
>>>>>> Per Namespace/Database
>>>>>>
>>>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>)
>>>>>>
>>>>>> Per Catalog
>>>>>>
>>>>>> CALL hive_prod.system.expire_snapshots(<Options>)
>>>>>>
>>>>>> Multiple Tables
>>>>>>
>>>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
>>>>>> 'db2.table2), <Options>)
>>>>>>
>>>>>> PS: There could be exceptions for individual catalogs. Like Nessie
>>>>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the
>>>>>> Namespaces.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Naveen Kumar
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>>

Re: [DISCUSS] Run GC with Catalog or Tables

Reply via email to