Running GC across the entire catalog has always been something I want to explore, because of one particular benefit:
People typically use one S3 bucket for all tables in a catalog, and you can run a JOIN of the union of all files metadata table against the S3 inventory list <https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html> to figure out file diff, and avoid time-consuming file listing during orphan file removal. This process can be fully distributed and avoid OOM errors. I had an issue tracking it in the past: https://github.com/apache/iceberg/issues/7111 Any thoughts? -Jack On Thu, Dec 7, 2023 at 10:40 AM Yufei Gu <flyrain...@gmail.com> wrote: > Error report and Retry are tricky. > 1. Do we retry from the beginning or from the middle if the procedure > failed in the middle while expiring snapshots for one catalog? If we > started from the beginning, some tables may never get GCed. > 2. Users may be more interested in some tables getting GCed instead of the > whole catalog. How do we report the results back to users? > > I feel like it becomes a service eventually instead of a single procedure > considering the complexity. That's also what the > Iceberg infrastructure teams usually do. > > Yufei > > > On Wed, Dec 6, 2023 at 11:21 PM Naveen Kumar <nk1...@gmail.com> wrote: > >> My concern with the per-catalog approach is that people might >>> accidentally run it. Do you think it's clear enough that these invocations >>> will drop older snapshots? >>> >> As @Andrea has mentioned, the existing implementation of expire snapshots >> on every table should help. >> >> I just think this is a bit more complicated than I want to take into the >>> main library just because we have to make decisions about >>> >>> 1. Retries >>> 2. Concurrency >>> 3. Results/Error Reporting >>> >> >> I haven't brainstormed much about implementation as of now. But as you >> mentioned the bigger challenge is around *Concurrency, *any mishandling >> of the same can cause OOM. Also it would be difficult to *report* all >> the failed tables/snapshots. >> >> Also iceberg catalog supports nested namespace, so maybe we need to >>> consider more general syntax for only database, table levels. >>> >> Agreed. >> >> I started this discussion to take the opinions of individuals on the idea >> of GC per catalog. If this does sound a good use case, I can start spending >> time around the complexity and challenges. >> >> Please advise. >> >> Regards, >> Naveen Kumar >> >> >> >> On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >>> Also iceberg catalog supports nested namespace, so maybe we need to >>> consider more general syntax for only database, table levels. >>> >>> On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> I just think this is a bit more complicated than I want to take into >>>> the main library just because we have to make decisions about >>>> >>>> 1. Retries >>>> 2. Concurrency >>>> 3. Results/Error Reporting >>>> >>>> But if we have a good proposal for we will handle all those I think we >>>> could do it? >>>> >>>> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com> >>>> wrote: >>>> >>>> I think that if you call an expire snapshots function this is exactly >>>> what you want >>>> >>>> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> My concern with the per-catalog approach is that people might >>>>> accidentally run it. Do you think it's clear enough that these invocations >>>>> will drop older snapshots? >>>>> >>>>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi < >>>>> acampolon...@gmail.com> wrote: >>>>> >>>>>> I like this approach. + 1 >>>>>> >>>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote: >>>>>> >>>>>> Hi Everyone, >>>>>> >>>>>> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files >>>>>> *per table. >>>>>> >>>>>> Today, if someone has to run GCs on an entire catalog they will have >>>>>> to manually run these procedures for every table. >>>>>> >>>>>> Is it a good idea to do it in bulk as per catalog or with multiple >>>>>> tables ? >>>>>> >>>>>> Current syntax: >>>>>> >>>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>) >>>>>> >>>>>> Proposed Syntax something similar: >>>>>> >>>>>> Per Namespace/Database >>>>>> >>>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>) >>>>>> >>>>>> Per Catalog >>>>>> >>>>>> CALL hive_prod.system.expire_snapshots(<Options>) >>>>>> >>>>>> Multiple Tables >>>>>> >>>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', >>>>>> 'db2.table2), <Options>) >>>>>> >>>>>> PS: There could be exceptions for individual catalogs. Like Nessie >>>>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the >>>>>> Namespaces. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Naveen Kumar >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>>>