Re: [DISCUSS] Run GC with Catalog or Tables

Naveen Kumar Wed, 06 Dec 2023 23:21:49 -0800

>
> My concern with the per-catalog approach is that people might accidentally
> run it. Do you think it's clear enough that these invocations will drop
> older snapshots?
>
As @Andrea has mentioned, the existing implementation of expire snapshots
on every table should help.


I just think this is a bit more complicated than I want to take into the
> main library just because we have to make decisions about
>
> 1. Retries
> 2. Concurrency
> 3. Results/Error Reporting
>

I haven't brainstormed much about implementation as of now. But as you
mentioned the bigger challenge is around *Concurrency, *any mishandling of
the same can cause OOM. Also it would be difficult to *report* all the
failed tables/snapshots.

Also iceberg catalog supports nested namespace, so maybe we need to
> consider more general syntax for only database, table levels.
>
Agreed.

I started this discussion to take the opinions of individuals on the idea
of GC per catalog. If this does sound a good use case, I can start spending
time around the complexity and challenges.

Please advise.

Regards,
Naveen Kumar



On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com> wrote:

> Also iceberg catalog supports nested namespace, so maybe we need to
> consider more general syntax for only database, table levels.
>
> On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I just think this is a bit more complicated than I want to take into the
>> main library just because we have to make decisions about
>>
>> 1. Retries
>> 2. Concurrency
>> 3. Results/Error Reporting
>>
>> But if we have a good proposal for we will handle all those I think we
>> could do it?
>>
>> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com>
>> wrote:
>>
>> I think that if you call an expire snapshots function this is exactly
>> what you want
>>
>> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote:
>>
>>> My concern with the per-catalog approach is that people might
>>> accidentally run it. Do you think it's clear enough that these invocations
>>> will drop older snapshots?
>>>
>>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi <
>>> acampolon...@gmail.com> wrote:
>>>
>>>> I like this approach. + 1
>>>>
>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files
>>>> *per table.
>>>>
>>>> Today, if someone has to run GCs on an entire catalog they will have to
>>>> manually run these procedures for every table.
>>>>
>>>> Is it a good idea to do it in bulk as per catalog or with multiple
>>>> tables ?
>>>>
>>>> Current syntax:
>>>>
>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>)
>>>>
>>>> Proposed Syntax something similar:
>>>>
>>>> Per Namespace/Database
>>>>
>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>)
>>>>
>>>> Per Catalog
>>>>
>>>> CALL hive_prod.system.expire_snapshots(<Options>)
>>>>
>>>> Multiple Tables
>>>>
>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
>>>> 'db2.table2), <Options>)
>>>>
>>>> PS: There could be exceptions for individual catalogs. Like Nessie
>>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the
>>>> Namespaces.
>>>>
>>>>
>>>> Regards,
>>>> Naveen Kumar
>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>>

Re: [DISCUSS] Run GC with Catalog or Tables

Reply via email to