Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-07 Thread Jack Ye
Regarding the 2 points Yufei brought up, in the inventory list way, I think it would offer the following experience: *1. Do we retry from the beginning or from the middle if the procedure failed in the middle while expiring snapshots for one catalog? If we started from the beginning, some tables m

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-07 Thread Jack Ye
Running GC across the entire catalog has always been something I want to explore, because of one particular benefit: People typically use one S3 bucket for all tables in a catalog, and you can run a JOIN of the union of all files metadata table against the S3 inventory list

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-07 Thread Yufei Gu
Error report and Retry are tricky. 1. Do we retry from the beginning or from the middle if the procedure failed in the middle while expiring snapshots for one catalog? If we started from the beginning, some tables may never get GCed. 2. Users may be more interested in some tables getting GCed inste

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Naveen Kumar
> > My concern with the per-catalog approach is that people might accidentally > run it. Do you think it's clear enough that these invocations will drop > older snapshots? > As @Andrea has mentioned, the existing implementation of expire snapshots on every table should help. I just think this is a

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Renjie Liu
Also iceberg catalog supports nested namespace, so maybe we need to consider more general syntax for only database, table levels. On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer wrote: > I just think this is a bit more complicated than I want to take into the > main library just because we have t

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Russell Spitzer
I just think this is a bit more complicated than I want to take into the main library just because we have to make decisions about 1. Retries 2. Concurrency 3. Results/Error Reporting But if we have a good proposal for we will handle all those I think we could do it? > On Dec 6, 2023, at 2:05

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Andrea Campolonghi
I think that if you call an expire snapshots function this is exactly what you want On Wed, Dec 6, 2023 at 18:47 Ryan Blue wrote: > My concern with the per-catalog approach is that people might accidentally > run it. Do you think it's clear enough that these invocations will drop > older snapsho

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Ryan Blue
My concern with the per-catalog approach is that people might accidentally run it. Do you think it's clear enough that these invocations will drop older snapshots? On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi wrote: > I like this approach. + 1 > > On 6 Dec 2023, at 11:37, naveen wrote: > >

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Andrea Campolonghi
I like this approach. + 1 > On 6 Dec 2023, at 11:37, naveen wrote: > > Hi Everyone, > > Currently Spark-Procedures supports expire_snapshots/remove_orphan_files per > table. > > Today, if someone has to run GCs on an entire catalog they will have to > manually run these procedures for every

[DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread naveen
Hi Everyone, Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files *per table. Today, if someone has to run GCs on an entire catalog they will have to manually run these procedures for every table. Is it a good idea to do it in bulk as per catalog or with multiple tables ? C