I think it's reasonable to expose the options through the stored procedure. I just don't think that we want to change to make it the default behavior.
On Mon, Jul 7, 2025 at 8:37 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > I’m not seeing how Spark procedure contradicts to the catalog solution. > Catalogs can make decisions based on policies and pass down parameters to > spark procedures to execute. In addition, it can be used by all catalogs > and table maintenance systems. > > Regards, > Manu > > Gábor Kaszab <gaborkas...@gmail.com>于2025年7月7日 周一21:31写道: > >> Thanks for the response, JB! >> >> This could be a responsibility of the catalog and in turn a TMS, I agree. >> However, that seems more a mig/long-term solution, while the Spark >> expire_snapshots procedure is already there, the Java core implementation >> to clean expired specs and schemas is already there within RemoveSnapshots >> API, we just have to connect the dots by exposing a boolean flag through >> the procedure (same for Flink). >> We could still expect many users/vendors in my opinion to keep using >> Spark procedures for table maintenance for a long time and this low-risk >> change could help them out. There seemed to be other people sharing this >> thinking and being interested in this change, hence I gave this >> conversation another go. >> >> LMK WDYT! >> >> Regards, >> Gabor Kaszab >> >> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3., >> Cs, 9:57): >> >>> Hi Gabor >>> >>> I would consider cleanExpiredMetadata as a table maintenance procedure. >>> So, I agree that it should be managed by a catalog (as part of catalog >>> policies and TMS). I'm not against to switch the cleanExpiredMetadata >>> flag to true, and let the query engine and the catalog deal with that. >>> >>> Regards >>> JB >>> >>> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org> >>> wrote: >>> > >>> > Hi Iceberg Community, >>> > >>> > It's been a while since the last activity on this thread but let me >>> bump this conversation because there were people showing some interest in >>> giving a way of switching `cleanExpiredMetadata` through procedures (Manu, >>> Peter, Pucheng). >>> > I understand the long term goal is to delegate such functionality to >>> catalogs instead, but could we reconsider this addition for the shorter >>> term? >>> > >>> > Regards, >>> > Gabor Kaszab >>> > >>> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. >>> máj. 12., H, 16:14): >>> >> >>> >> Thanks all for the discussion. I also agree that we should make this >>> behavior turned off by default. And I would also love to see this flag be >>> added to the Spark/ Flink procedure. I think having this feature available >>> on the client side seems more achievable in the short run and designing a >>> server side solution might take more time (i.e. spec change, vendor >>> implementation etc). >>> >> >>> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> >>> wrote: >>> >>> >>> >>> Thanks for the responses! >>> >>> >>> >>> My concern is the same, Manu, Peter: many stakeholders in this >>> community don't have a catalog that is capable of executing table >>> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions >>> for this purpose. I feel that we should still give them the new >>> functionality to clean expired metadata (specs, schemas) by extending the >>> Spark and Flink procedures. >>> >>> >>> >>> Regards, >>> >>> Gabor >>> >>> >>> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry < >>> peter.vary.apa...@gmail.com> wrote: >>> >>>> >>> >>>> I know of several companies who are using either scheduled stored >>> procedures or the existing actions to maintain production tables. >>> >>>> I don't think we should deprecate them until there is a viable open >>> solution for them. >>> >>>> >>> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. >>> márc. 19., Sze, 17:52): >>> >>>>> >>> >>>>> I think a catalog service can also use Spark/Flink procedures for >>> table maintenance, to utilize existing systems and cluster resources. >>> >>>>> >>> >>>>> If we no longer support new functionality in Spark/Flink >>> procedures, we are effectively deprecating them, right? >>> >>>>> >>> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: >>> >>>>>> >>> >>>>>> Thanks for the responses so far! >>> >>>>>> >>> >>>>>> Sure, keeping the default as false makes sense because this is a >>> new feature, so let's be on the safe side. >>> >>>>>> >>> >>>>>> About exposing setting the flag in the Spark action/procedure and >>> also via Flink: >>> >>>>>> I believe currently there are a number of vendors that don't have >>> a catalog that is capable of performing table maintenance. We for instance >>> advise our users to use spark procedures for table maintenance. Hence, it >>> would come quite handy for us to also have a way to control the >>> functionality behind the 'cleanExpiredMetadata' flag through the >>> expire_snapshots procedure. Since the functionality is already there in the >>> Java ExpireSnapshots API, this seems a low effort exercise. >>> >>>>>> I'd like to avoid telling the users to call the Java API >>> directly, but if extending the procedure is not an option, and also the >>> used catalog implementation doesn't give support for this, I don't see what >>> other possibilities we have here. >>> >>>>>> Taking these into consideration, would it be possible to allow >>> extending the Spark and Flink with the support of setting this flag? >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> Gabor >>> >>>>>> >>> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> >>> wrote: >>> >>>>>>> >>> >>>>>>> I don't think it is necessary to either make cleanup the default >>> or to expose the flag in Spark or other engines. >>> >>>>>>> >>> >>>>>>> Right now, catalogs are taking on a lot more responsibility for >>> things like snapshot expiration, orphan file cleanup, and schema or >>> partition spec removal. Ideally, those are tasks that catalogs handle >>> rather than having clients run them, but right now we have systems for >>> keeping tables clean (i.e. expiring snapshots) that are built using clients >>> rather than being controlled through catalogs. That's not a problem and we >>> want to continue to support them, but I also don't think that we should >>> make the problem worse. I think we should consider schema and partition >>> spec cleanup to be catalog service tasks, so we should not spend much >>> effort to make them easily available to users. And we should not make them >>> the default behavior because we don't want clients removing these manually >>> and duplicating work on the client and in REST services. >>> >>>>>>> >>> >>>>>>> Ryan >>> >>>>>>> >>> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré < >>> j...@nanthrax.net> wrote: >>> >>>>>>>> >>> >>>>>>>> Hi Gabor >>> >>>>>>>> >>> >>>>>>>> I think the question is "when". As it's a behavior change, I >>> don't >>> >>>>>>>> think we should do that on a "minor" release, else users would >>> be >>> >>>>>>>> "surprised". >>> >>>>>>>> I would propose to keep the current behavior on Iceberg Java >>> 1.x and >>> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote). >>> >>>>>>>> >>> >>>>>>>> Regards >>> >>>>>>>> JB >>> >>>>>>>> >>> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab < >>> gaborkas...@apache.org> wrote: >>> >>>>>>>> > >>> >>>>>>>> > Hi Iceberg Community, >>> >>>>>>>> > >>> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the >>> unused partition specs and schemas. This is controlled by a flag called >>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark >>> and Flink don't offer a way to set this flag currently. >>> >>>>>>>> > >>> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>> >>>>>>>> > I'm wondering if it's desired by the community to default >>> this flag to true. The effect of that would be that each snapshot >>> expiration would also clean up the unused partition specs and schemas too. >>> This functionality is quite new so this might need some extra confidence by >>> the community before turning on by default but I think it's worth a >>> consideration. >>> >>>>>>>> > >>> >>>>>>>> > 2) Spark and Flink to support setting this flag >>> >>>>>>>> > I think it makes sense to add support in Spark's >>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>> based on (user) inputs. >>> >>>>>>>> > >>> >>>>>>>> > WDYT? >>> >>>>>>>> > >>> >>>>>>>> > Regards, >>> >>>>>>>> > Gabor >>> >>