Thanks for the response, JB! This could be a responsibility of the catalog and in turn a TMS, I agree. However, that seems more a mig/long-term solution, while the Spark expire_snapshots procedure is already there, the Java core implementation to clean expired specs and schemas is already there within RemoveSnapshots API, we just have to connect the dots by exposing a boolean flag through the procedure (same for Flink). We could still expect many users/vendors in my opinion to keep using Spark procedures for table maintenance for a long time and this low-risk change could help them out. There seemed to be other people sharing this thinking and being interested in this change, hence I gave this conversation another go.
LMK WDYT! Regards, Gabor Kaszab Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3., Cs, 9:57): > Hi Gabor > > I would consider cleanExpiredMetadata as a table maintenance procedure. > So, I agree that it should be managed by a catalog (as part of catalog > policies and TMS). I'm not against to switch the cleanExpiredMetadata > flag to true, and let the query engine and the catalog deal with that. > > Regards > JB > > On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org> > wrote: > > > > Hi Iceberg Community, > > > > It's been a while since the last activity on this thread but let me bump > this conversation because there were people showing some interest in giving > a way of switching `cleanExpiredMetadata` through procedures (Manu, Peter, > Pucheng). > > I understand the long term goal is to delegate such functionality to > catalogs instead, but could we reconsider this addition for the shorter > term? > > > > Regards, > > Gabor Kaszab > > > > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. > máj. 12., H, 16:14): > >> > >> Thanks all for the discussion. I also agree that we should make this > behavior turned off by default. And I would also love to see this flag be > added to the Spark/ Flink procedure. I think having this feature available > on the client side seems more achievable in the short run and designing a > server side solution might take more time (i.e. spec change, vendor > implementation etc). > >> > >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >>> > >>> Thanks for the responses! > >>> > >>> My concern is the same, Manu, Peter: many stakeholders in this > community don't have a catalog that is capable of executing table > maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions > for this purpose. I feel that we should still give them the new > functionality to clean expired metadata (specs, schemas) by extending the > Spark and Flink procedures. > >>> > >>> Regards, > >>> Gabor > >>> > >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry < > peter.vary.apa...@gmail.com> wrote: > >>>> > >>>> I know of several companies who are using either scheduled stored > procedures or the existing actions to maintain production tables. > >>>> I don't think we should deprecate them until there is a viable open > solution for them. > >>>> > >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. > 19., Sze, 17:52): > >>>>> > >>>>> I think a catalog service can also use Spark/Flink procedures for > table maintenance, to utilize existing systems and cluster resources. > >>>>> > >>>>> If we no longer support new functionality in Spark/Flink procedures, > we are effectively deprecating them, right? > >>>>> > >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: > >>>>>> > >>>>>> Thanks for the responses so far! > >>>>>> > >>>>>> Sure, keeping the default as false makes sense because this is a > new feature, so let's be on the safe side. > >>>>>> > >>>>>> About exposing setting the flag in the Spark action/procedure and > also via Flink: > >>>>>> I believe currently there are a number of vendors that don't have a > catalog that is capable of performing table maintenance. We for instance > advise our users to use spark procedures for table maintenance. Hence, it > would come quite handy for us to also have a way to control the > functionality behind the 'cleanExpiredMetadata' flag through the > expire_snapshots procedure. Since the functionality is already there in the > Java ExpireSnapshots API, this seems a low effort exercise. > >>>>>> I'd like to avoid telling the users to call the Java API directly, > but if extending the procedure is not an option, and also the used catalog > implementation doesn't give support for this, I don't see what other > possibilities we have here. > >>>>>> Taking these into consideration, would it be possible to allow > extending the Spark and Flink with the support of setting this flag? > >>>>>> > >>>>>> Thanks, > >>>>>> Gabor > >>>>>> > >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote: > >>>>>>> > >>>>>>> I don't think it is necessary to either make cleanup the default > or to expose the flag in Spark or other engines. > >>>>>>> > >>>>>>> Right now, catalogs are taking on a lot more responsibility for > things like snapshot expiration, orphan file cleanup, and schema or > partition spec removal. Ideally, those are tasks that catalogs handle > rather than having clients run them, but right now we have systems for > keeping tables clean (i.e. expiring snapshots) that are built using clients > rather than being controlled through catalogs. That's not a problem and we > want to continue to support them, but I also don't think that we should > make the problem worse. I think we should consider schema and partition > spec cleanup to be catalog service tasks, so we should not spend much > effort to make them easily available to users. And we should not make them > the default behavior because we don't want clients removing these manually > and duplicating work on the client and in REST services. > >>>>>>> > >>>>>>> Ryan > >>>>>>> > >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré < > j...@nanthrax.net> wrote: > >>>>>>>> > >>>>>>>> Hi Gabor > >>>>>>>> > >>>>>>>> I think the question is "when". As it's a behavior change, I don't > >>>>>>>> think we should do that on a "minor" release, else users would be > >>>>>>>> "surprised". > >>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x > and > >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote). > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> JB > >>>>>>>> > >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab < > gaborkas...@apache.org> wrote: > >>>>>>>> > > >>>>>>>> > Hi Iceberg Community, > >>>>>>>> > > >>>>>>>> > There were recent additions to RemoveSnapshots to expire the > unused partition specs and schemas. This is controlled by a flag called > 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark > and Flink don't offer a way to set this flag currently. > >>>>>>>> > > >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata > >>>>>>>> > I'm wondering if it's desired by the community to default this > flag to true. The effect of that would be that each snapshot expiration > would also clean up the unused partition specs and schemas too. This > functionality is quite new so this might need some extra confidence by the > community before turning on by default but I think it's worth a > consideration. > >>>>>>>> > > >>>>>>>> > 2) Spark and Flink to support setting this flag > >>>>>>>> > I think it makes sense to add support in Spark's > ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's > ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag > based on (user) inputs. > >>>>>>>> > > >>>>>>>> > WDYT? > >>>>>>>> > > >>>>>>>> > Regards, > >>>>>>>> > Gabor >