Just an FYI, here is the PR <https://github.com/apache/iceberg/pull/13509> for the Spark procedure. Only for 4.0 as of now, will backport to other Spark versions once this is finalized.
Thanks again! Gabor Gábor Kaszab <gaborkas...@apache.org> ezt írta (időpont: 2025. júl. 8., K, 15:57): > Thank you all for taking a look and sharing your opinions! > > It seems we have consensus to extend the Spark procedure with a parameter > to control this functionality. Let me prepare a PR for this and get back to > you. Also I'll take a look at Flink usage too. > > Regards, > Gabor Kaszab > > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 8., > K, 11:16): > >> Hi >> >> I think it makes sense to have a procedure in spark for that. My point >> was about the catalog long term solution. >> >> So short term, +1 for a spark procedure. Long term, we should not forget >> the catalog (especially for engine interoperability). >> >> Thanks! >> >> Regards >> JB >> >> Le lun. 7 juil. 2025 à 09:31, Gábor Kaszab <gaborkas...@gmail.com> a >> écrit : >> >>> Thanks for the response, JB! >>> >>> This could be a responsibility of the catalog and in turn a TMS, I >>> agree. However, that seems more a mig/long-term solution, while the Spark >>> expire_snapshots procedure is already there, the Java core implementation >>> to clean expired specs and schemas is already there within RemoveSnapshots >>> API, we just have to connect the dots by exposing a boolean flag through >>> the procedure (same for Flink). >>> We could still expect many users/vendors in my opinion to keep using >>> Spark procedures for table maintenance for a long time and this low-risk >>> change could help them out. There seemed to be other people sharing this >>> thinking and being interested in this change, hence I gave this >>> conversation another go. >>> >>> LMK WDYT! >>> >>> Regards, >>> Gabor Kaszab >>> >>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. >>> 3., Cs, 9:57): >>> >>>> Hi Gabor >>>> >>>> I would consider cleanExpiredMetadata as a table maintenance procedure. >>>> So, I agree that it should be managed by a catalog (as part of catalog >>>> policies and TMS). I'm not against to switch the cleanExpiredMetadata >>>> flag to true, and let the query engine and the catalog deal with that. >>>> >>>> Regards >>>> JB >>>> >>>> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org> >>>> wrote: >>>> > >>>> > Hi Iceberg Community, >>>> > >>>> > It's been a while since the last activity on this thread but let me >>>> bump this conversation because there were people showing some interest in >>>> giving a way of switching `cleanExpiredMetadata` through procedures (Manu, >>>> Peter, Pucheng). >>>> > I understand the long term goal is to delegate such functionality to >>>> catalogs instead, but could we reconsider this addition for the shorter >>>> term? >>>> > >>>> > Regards, >>>> > Gabor Kaszab >>>> > >>>> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. >>>> máj. 12., H, 16:14): >>>> >> >>>> >> Thanks all for the discussion. I also agree that we should make this >>>> behavior turned off by default. And I would also love to see this flag be >>>> added to the Spark/ Flink procedure. I think having this feature available >>>> on the client side seems more achievable in the short run and designing a >>>> server side solution might take more time (i.e. spec change, vendor >>>> implementation etc). >>>> >> >>>> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> >>>> wrote: >>>> >>> >>>> >>> Thanks for the responses! >>>> >>> >>>> >>> My concern is the same, Manu, Peter: many stakeholders in this >>>> community don't have a catalog that is capable of executing table >>>> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions >>>> for this purpose. I feel that we should still give them the new >>>> functionality to clean expired metadata (specs, schemas) by extending the >>>> Spark and Flink procedures. >>>> >>> >>>> >>> Regards, >>>> >>> Gabor >>>> >>> >>>> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry < >>>> peter.vary.apa...@gmail.com> wrote: >>>> >>>> >>>> >>>> I know of several companies who are using either scheduled stored >>>> procedures or the existing actions to maintain production tables. >>>> >>>> I don't think we should deprecate them until there is a viable >>>> open solution for them. >>>> >>>> >>>> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. >>>> márc. 19., Sze, 17:52): >>>> >>>>> >>>> >>>>> I think a catalog service can also use Spark/Flink procedures for >>>> table maintenance, to utilize existing systems and cluster resources. >>>> >>>>> >>>> >>>>> If we no longer support new functionality in Spark/Flink >>>> procedures, we are effectively deprecating them, right? >>>> >>>>> >>>> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: >>>> >>>>>> >>>> >>>>>> Thanks for the responses so far! >>>> >>>>>> >>>> >>>>>> Sure, keeping the default as false makes sense because this is a >>>> new feature, so let's be on the safe side. >>>> >>>>>> >>>> >>>>>> About exposing setting the flag in the Spark action/procedure >>>> and also via Flink: >>>> >>>>>> I believe currently there are a number of vendors that don't >>>> have a catalog that is capable of performing table maintenance. We for >>>> instance advise our users to use spark procedures for table maintenance. >>>> Hence, it would come quite handy for us to also have a way to control the >>>> functionality behind the 'cleanExpiredMetadata' flag through the >>>> expire_snapshots procedure. Since the functionality is already there in the >>>> Java ExpireSnapshots API, this seems a low effort exercise. >>>> >>>>>> I'd like to avoid telling the users to call the Java API >>>> directly, but if extending the procedure is not an option, and also the >>>> used catalog implementation doesn't give support for this, I don't see what >>>> other possibilities we have here. >>>> >>>>>> Taking these into consideration, would it be possible to allow >>>> extending the Spark and Flink with the support of setting this flag? >>>> >>>>>> >>>> >>>>>> Thanks, >>>> >>>>>> Gabor >>>> >>>>>> >>>> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>> I don't think it is necessary to either make cleanup the >>>> default or to expose the flag in Spark or other engines. >>>> >>>>>>> >>>> >>>>>>> Right now, catalogs are taking on a lot more responsibility for >>>> things like snapshot expiration, orphan file cleanup, and schema or >>>> partition spec removal. Ideally, those are tasks that catalogs handle >>>> rather than having clients run them, but right now we have systems for >>>> keeping tables clean (i.e. expiring snapshots) that are built using clients >>>> rather than being controlled through catalogs. That's not a problem and we >>>> want to continue to support them, but I also don't think that we should >>>> make the problem worse. I think we should consider schema and partition >>>> spec cleanup to be catalog service tasks, so we should not spend much >>>> effort to make them easily available to users. And we should not make them >>>> the default behavior because we don't want clients removing these manually >>>> and duplicating work on the client and in REST services. >>>> >>>>>>> >>>> >>>>>>> Ryan >>>> >>>>>>> >>>> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré < >>>> j...@nanthrax.net> wrote: >>>> >>>>>>>> >>>> >>>>>>>> Hi Gabor >>>> >>>>>>>> >>>> >>>>>>>> I think the question is "when". As it's a behavior change, I >>>> don't >>>> >>>>>>>> think we should do that on a "minor" release, else users would >>>> be >>>> >>>>>>>> "surprised". >>>> >>>>>>>> I would propose to keep the current behavior on Iceberg Java >>>> 1.x and >>>> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote). >>>> >>>>>>>> >>>> >>>>>>>> Regards >>>> >>>>>>>> JB >>>> >>>>>>>> >>>> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab < >>>> gaborkas...@apache.org> wrote: >>>> >>>>>>>> > >>>> >>>>>>>> > Hi Iceberg Community, >>>> >>>>>>>> > >>>> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the >>>> unused partition specs and schemas. This is controlled by a flag called >>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark >>>> and Flink don't offer a way to set this flag currently. >>>> >>>>>>>> > >>>> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>>> >>>>>>>> > I'm wondering if it's desired by the community to default >>>> this flag to true. The effect of that would be that each snapshot >>>> expiration would also clean up the unused partition specs and schemas too. >>>> This functionality is quite new so this might need some extra confidence by >>>> the community before turning on by default but I think it's worth a >>>> consideration. >>>> >>>>>>>> > >>>> >>>>>>>> > 2) Spark and Flink to support setting this flag >>>> >>>>>>>> > I think it makes sense to add support in Spark's >>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>>> based on (user) inputs. >>>> >>>>>>>> > >>>> >>>>>>>> > WDYT? >>>> >>>>>>>> > >>>> >>>>>>>> > Regards, >>>> >>>>>>>> > Gabor >>>> >>>