Hi Iceberg Community, It's been a while since the last activity on this thread but let me bump this conversation because there were people showing some interest in giving a way of switching `cleanExpiredMetadata` through procedures (Manu, Peter, Pucheng). I understand the long term goal is to delegate such functionality to catalogs instead, but could we reconsider this addition for the shorter term?
Regards, Gabor Kaszab Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. máj. 12., H, 16:14): > Thanks all for the discussion. I also agree that we should make this > behavior turned off by default. And I would also love to see this flag be > added to the Spark/ Flink procedure. I think having this feature > available on the client side seems more achievable in the short run and > designing a server side solution might take more time (i.e. spec change, > vendor implementation etc). > > On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> > wrote: > >> Thanks for the responses! >> >> My concern is the same, Manu, Peter: many stakeholders in this community >> don't have a catalog that is capable of executing table maintenance (e.g. >> HiveCatalog) and rely on the Spark procedures and actions for this purpose. >> I feel that we should still give them the new functionality to clean >> expired metadata (specs, schemas) by extending the Spark and Flink >> procedures. >> >> Regards, >> Gabor >> >> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> I know of several companies who are using either scheduled stored >>> procedures or the existing actions to maintain production tables. >>> I don't think we should deprecate them until there is a viable open >>> solution for them. >>> >>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. >>> 19., Sze, 17:52): >>> >>>> I think a catalog service can also use Spark/Flink procedures for table >>>> maintenance, to utilize existing systems and cluster resources. >>>> >>>> If we no longer support new functionality in Spark/Flink procedures, we >>>> are effectively deprecating them, right? >>>> >>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: >>>> >>>>> Thanks for the responses so far! >>>>> >>>>> Sure, keeping the default as false makes sense because this is a new >>>>> feature, so let's be on the safe side. >>>>> >>>>> About exposing setting the flag in the Spark action/procedure and also >>>>> via Flink: >>>>> I believe currently there are a number of vendors that don't have a >>>>> catalog that is capable of performing table maintenance. We for instance >>>>> advise our users to use spark procedures for table maintenance. Hence, it >>>>> would come quite handy for us to also have a way to control the >>>>> functionality behind the 'cleanExpiredMetadata' flag through the >>>>> expire_snapshots procedure. Since the functionality is already there in >>>>> the >>>>> Java ExpireSnapshots API, this seems a low effort exercise. >>>>> I'd like to avoid telling the users to call the Java API directly, but >>>>> if extending the procedure is not an option, and also the used catalog >>>>> implementation doesn't give support for this, I don't see what other >>>>> possibilities we have here. >>>>> Taking these into consideration, would it be possible to allow >>>>> extending the Spark and Flink with the support of setting this flag? >>>>> >>>>> Thanks, >>>>> Gabor >>>>> >>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote: >>>>> >>>>>> I don't think it is necessary to either make cleanup the default or >>>>>> to expose the flag in Spark or other engines. >>>>>> >>>>>> Right now, catalogs are taking on a lot more responsibility for >>>>>> things like snapshot expiration, orphan file cleanup, and schema or >>>>>> partition spec removal. Ideally, those are tasks that catalogs handle >>>>>> rather than having clients run them, but right now we have systems for >>>>>> keeping tables clean (i.e. expiring snapshots) that are built using >>>>>> clients >>>>>> rather than being controlled through catalogs. That's not a problem and >>>>>> we >>>>>> want to continue to support them, but I also don't think that we should >>>>>> make the problem worse. I think we should consider schema and partition >>>>>> spec cleanup to be catalog service tasks, so we should not spend much >>>>>> effort to make them easily available to users. And we should not make >>>>>> them >>>>>> the default behavior because we don't want clients removing these >>>>>> manually >>>>>> and duplicating work on the client and in REST services. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>>> wrote: >>>>>> >>>>>>> Hi Gabor >>>>>>> >>>>>>> I think the question is "when". As it's a behavior change, I don't >>>>>>> think we should do that on a "minor" release, else users would be >>>>>>> "surprised". >>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x and >>>>>>> change the flag to true on Iceberg Java 2.x (after a vote). >>>>>>> >>>>>>> Regards >>>>>>> JB >>>>>>> >>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab < >>>>>>> gaborkas...@apache.org> wrote: >>>>>>> > >>>>>>> > Hi Iceberg Community, >>>>>>> > >>>>>>> > There were recent additions to RemoveSnapshots to expire the >>>>>>> unused partition specs and schemas. This is controlled by a flag called >>>>>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, >>>>>>> Spark >>>>>>> and Flink don't offer a way to set this flag currently. >>>>>>> > >>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>>>>>> > I'm wondering if it's desired by the community to default this >>>>>>> flag to true. The effect of that would be that each snapshot expiration >>>>>>> would also clean up the unused partition specs and schemas too. This >>>>>>> functionality is quite new so this might need some extra confidence by >>>>>>> the >>>>>>> community before turning on by default but I think it's worth a >>>>>>> consideration. >>>>>>> > >>>>>>> > 2) Spark and Flink to support setting this flag >>>>>>> > I think it makes sense to add support in Spark's >>>>>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>>>>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>>>>>> based on (user) inputs. >>>>>>> > >>>>>>> > WDYT? >>>>>>> > >>>>>>> > Regards, >>>>>>> > Gabor >>>>>>> >>>>>>