> > My concern with the per-catalog approach is that people might accidentally > run it. Do you think it's clear enough that these invocations will drop > older snapshots? > As @Andrea has mentioned, the existing implementation of expire snapshots on every table should help.
I just think this is a bit more complicated than I want to take into the > main library just because we have to make decisions about > > 1. Retries > 2. Concurrency > 3. Results/Error Reporting > I haven't brainstormed much about implementation as of now. But as you mentioned the bigger challenge is around *Concurrency, *any mishandling of the same can cause OOM. Also it would be difficult to *report* all the failed tables/snapshots. Also iceberg catalog supports nested namespace, so maybe we need to > consider more general syntax for only database, table levels. > Agreed. I started this discussion to take the opinions of individuals on the idea of GC per catalog. If this does sound a good use case, I can start spending time around the complexity and challenges. Please advise. Regards, Naveen Kumar On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > Also iceberg catalog supports nested namespace, so maybe we need to > consider more general syntax for only database, table levels. > > On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I just think this is a bit more complicated than I want to take into the >> main library just because we have to make decisions about >> >> 1. Retries >> 2. Concurrency >> 3. Results/Error Reporting >> >> But if we have a good proposal for we will handle all those I think we >> could do it? >> >> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com> >> wrote: >> >> I think that if you call an expire snapshots function this is exactly >> what you want >> >> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote: >> >>> My concern with the per-catalog approach is that people might >>> accidentally run it. Do you think it's clear enough that these invocations >>> will drop older snapshots? >>> >>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi < >>> acampolon...@gmail.com> wrote: >>> >>>> I like this approach. + 1 >>>> >>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote: >>>> >>>> Hi Everyone, >>>> >>>> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files >>>> *per table. >>>> >>>> Today, if someone has to run GCs on an entire catalog they will have to >>>> manually run these procedures for every table. >>>> >>>> Is it a good idea to do it in bulk as per catalog or with multiple >>>> tables ? >>>> >>>> Current syntax: >>>> >>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>) >>>> >>>> Proposed Syntax something similar: >>>> >>>> Per Namespace/Database >>>> >>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>) >>>> >>>> Per Catalog >>>> >>>> CALL hive_prod.system.expire_snapshots(<Options>) >>>> >>>> Multiple Tables >>>> >>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', >>>> 'db2.table2), <Options>) >>>> >>>> PS: There could be exceptions for individual catalogs. Like Nessie >>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the >>>> Namespaces. >>>> >>>> >>>> Regards, >>>> Naveen Kumar >>>> >>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> >>