Error report and Retry are tricky. 1. Do we retry from the beginning or from the middle if the procedure failed in the middle while expiring snapshots for one catalog? If we started from the beginning, some tables may never get GCed. 2. Users may be more interested in some tables getting GCed instead of the whole catalog. How do we report the results back to users?
I feel like it becomes a service eventually instead of a single procedure considering the complexity. That's also what the Iceberg infrastructure teams usually do. Yufei On Wed, Dec 6, 2023 at 11:21 PM Naveen Kumar <nk1...@gmail.com> wrote: > My concern with the per-catalog approach is that people might accidentally >> run it. Do you think it's clear enough that these invocations will drop >> older snapshots? >> > As @Andrea has mentioned, the existing implementation of expire snapshots > on every table should help. > > I just think this is a bit more complicated than I want to take into the >> main library just because we have to make decisions about >> >> 1. Retries >> 2. Concurrency >> 3. Results/Error Reporting >> > > I haven't brainstormed much about implementation as of now. But as you > mentioned the bigger challenge is around *Concurrency, *any mishandling > of the same can cause OOM. Also it would be difficult to *report* all the > failed tables/snapshots. > > Also iceberg catalog supports nested namespace, so maybe we need to >> consider more general syntax for only database, table levels. >> > Agreed. > > I started this discussion to take the opinions of individuals on the idea > of GC per catalog. If this does sound a good use case, I can start spending > time around the complexity and challenges. > > Please advise. > > Regards, > Naveen Kumar > > > > On Thu, Dec 7, 2023 at 9:24 AM Renjie Liu <liurenjie2...@gmail.com> wrote: > >> Also iceberg catalog supports nested namespace, so maybe we need to >> consider more general syntax for only database, table levels. >> >> On Thu, Dec 7, 2023 at 5:17 AM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> >>> I just think this is a bit more complicated than I want to take into the >>> main library just because we have to make decisions about >>> >>> 1. Retries >>> 2. Concurrency >>> 3. Results/Error Reporting >>> >>> But if we have a good proposal for we will handle all those I think we >>> could do it? >>> >>> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi <acampolon...@gmail.com> >>> wrote: >>> >>> I think that if you call an expire snapshots function this is exactly >>> what you want >>> >>> On Wed, Dec 6, 2023 at 18:47 Ryan Blue <b...@tabular.io> wrote: >>> >>>> My concern with the per-catalog approach is that people might >>>> accidentally run it. Do you think it's clear enough that these invocations >>>> will drop older snapshots? >>>> >>>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi < >>>> acampolon...@gmail.com> wrote: >>>> >>>>> I like this approach. + 1 >>>>> >>>>> On 6 Dec 2023, at 11:37, naveen <nk1...@gmail.com> wrote: >>>>> >>>>> Hi Everyone, >>>>> >>>>> Currently Spark-Procedures supports *expire_snapshots/remove_orphan_files >>>>> *per table. >>>>> >>>>> Today, if someone has to run GCs on an entire catalog they will have >>>>> to manually run these procedures for every table. >>>>> >>>>> Is it a good idea to do it in bulk as per catalog or with multiple >>>>> tables ? >>>>> >>>>> Current syntax: >>>>> >>>>> CALL hive_prod.system.expire_snapshots(table => 'db.sample', <Options>) >>>>> >>>>> Proposed Syntax something similar: >>>>> >>>>> Per Namespace/Database >>>>> >>>>> CALL hive_prod.system.expire_snapshots(database => 'db', <Options>) >>>>> >>>>> Per Catalog >>>>> >>>>> CALL hive_prod.system.expire_snapshots(<Options>) >>>>> >>>>> Multiple Tables >>>>> >>>>> CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', >>>>> 'db2.table2), <Options>) >>>>> >>>>> PS: There could be exceptions for individual catalogs. Like Nessie >>>>> doesn't support GC other than Nessie CLI. Hadoop can't list all the >>>>> Namespaces. >>>>> >>>>> >>>>> Regards, >>>>> Naveen Kumar >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >>>