Hi I think it makes sense to have a procedure in spark for that. My point was about the catalog long term solution.
So short term, +1 for a spark procedure. Long term, we should not forget the catalog (especially for engine interoperability). Thanks! Regards JB Le lun. 7 juil. 2025 à 09:31, Gábor Kaszab <gaborkas...@gmail.com> a écrit : > Thanks for the response, JB! > > This could be a responsibility of the catalog and in turn a TMS, I agree. > However, that seems more a mig/long-term solution, while the Spark > expire_snapshots procedure is already there, the Java core implementation > to clean expired specs and schemas is already there within RemoveSnapshots > API, we just have to connect the dots by exposing a boolean flag through > the procedure (same for Flink). > We could still expect many users/vendors in my opinion to keep using Spark > procedures for table maintenance for a long time and this low-risk change > could help them out. There seemed to be other people sharing this thinking > and being interested in this change, hence I gave this conversation another > go. > > LMK WDYT! > > Regards, > Gabor Kaszab > > Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3., > Cs, 9:57): > >> Hi Gabor >> >> I would consider cleanExpiredMetadata as a table maintenance procedure. >> So, I agree that it should be managed by a catalog (as part of catalog >> policies and TMS). I'm not against to switch the cleanExpiredMetadata >> flag to true, and let the query engine and the catalog deal with that. >> >> Regards >> JB >> >> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org> >> wrote: >> > >> > Hi Iceberg Community, >> > >> > It's been a while since the last activity on this thread but let me >> bump this conversation because there were people showing some interest in >> giving a way of switching `cleanExpiredMetadata` through procedures (Manu, >> Peter, Pucheng). >> > I understand the long term goal is to delegate such functionality to >> catalogs instead, but could we reconsider this addition for the shorter >> term? >> > >> > Regards, >> > Gabor Kaszab >> > >> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. >> máj. 12., H, 16:14): >> >> >> >> Thanks all for the discussion. I also agree that we should make this >> behavior turned off by default. And I would also love to see this flag be >> added to the Spark/ Flink procedure. I think having this feature available >> on the client side seems more achievable in the short run and designing a >> server side solution might take more time (i.e. spec change, vendor >> implementation etc). >> >> >> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> >> wrote: >> >>> >> >>> Thanks for the responses! >> >>> >> >>> My concern is the same, Manu, Peter: many stakeholders in this >> community don't have a catalog that is capable of executing table >> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions >> for this purpose. I feel that we should still give them the new >> functionality to clean expired metadata (specs, schemas) by extending the >> Spark and Flink procedures. >> >>> >> >>> Regards, >> >>> Gabor >> >>> >> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry < >> peter.vary.apa...@gmail.com> wrote: >> >>>> >> >>>> I know of several companies who are using either scheduled stored >> procedures or the existing actions to maintain production tables. >> >>>> I don't think we should deprecate them until there is a viable open >> solution for them. >> >>>> >> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. >> 19., Sze, 17:52): >> >>>>> >> >>>>> I think a catalog service can also use Spark/Flink procedures for >> table maintenance, to utilize existing systems and cluster resources. >> >>>>> >> >>>>> If we no longer support new functionality in Spark/Flink >> procedures, we are effectively deprecating them, right? >> >>>>> >> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: >> >>>>>> >> >>>>>> Thanks for the responses so far! >> >>>>>> >> >>>>>> Sure, keeping the default as false makes sense because this is a >> new feature, so let's be on the safe side. >> >>>>>> >> >>>>>> About exposing setting the flag in the Spark action/procedure and >> also via Flink: >> >>>>>> I believe currently there are a number of vendors that don't have >> a catalog that is capable of performing table maintenance. We for instance >> advise our users to use spark procedures for table maintenance. Hence, it >> would come quite handy for us to also have a way to control the >> functionality behind the 'cleanExpiredMetadata' flag through the >> expire_snapshots procedure. Since the functionality is already there in the >> Java ExpireSnapshots API, this seems a low effort exercise. >> >>>>>> I'd like to avoid telling the users to call the Java API directly, >> but if extending the procedure is not an option, and also the used catalog >> implementation doesn't give support for this, I don't see what other >> possibilities we have here. >> >>>>>> Taking these into consideration, would it be possible to allow >> extending the Spark and Flink with the support of setting this flag? >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Gabor >> >>>>>> >> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> >> wrote: >> >>>>>>> >> >>>>>>> I don't think it is necessary to either make cleanup the default >> or to expose the flag in Spark or other engines. >> >>>>>>> >> >>>>>>> Right now, catalogs are taking on a lot more responsibility for >> things like snapshot expiration, orphan file cleanup, and schema or >> partition spec removal. Ideally, those are tasks that catalogs handle >> rather than having clients run them, but right now we have systems for >> keeping tables clean (i.e. expiring snapshots) that are built using clients >> rather than being controlled through catalogs. That's not a problem and we >> want to continue to support them, but I also don't think that we should >> make the problem worse. I think we should consider schema and partition >> spec cleanup to be catalog service tasks, so we should not spend much >> effort to make them easily available to users. And we should not make them >> the default behavior because we don't want clients removing these manually >> and duplicating work on the client and in REST services. >> >>>>>>> >> >>>>>>> Ryan >> >>>>>>> >> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré < >> j...@nanthrax.net> wrote: >> >>>>>>>> >> >>>>>>>> Hi Gabor >> >>>>>>>> >> >>>>>>>> I think the question is "when". As it's a behavior change, I >> don't >> >>>>>>>> think we should do that on a "minor" release, else users would be >> >>>>>>>> "surprised". >> >>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x >> and >> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote). >> >>>>>>>> >> >>>>>>>> Regards >> >>>>>>>> JB >> >>>>>>>> >> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab < >> gaborkas...@apache.org> wrote: >> >>>>>>>> > >> >>>>>>>> > Hi Iceberg Community, >> >>>>>>>> > >> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the >> unused partition specs and schemas. This is controlled by a flag called >> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark >> and Flink don't offer a way to set this flag currently. >> >>>>>>>> > >> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >> >>>>>>>> > I'm wondering if it's desired by the community to default this >> flag to true. The effect of that would be that each snapshot expiration >> would also clean up the unused partition specs and schemas too. This >> functionality is quite new so this might need some extra confidence by the >> community before turning on by default but I think it's worth a >> consideration. >> >>>>>>>> > >> >>>>>>>> > 2) Spark and Flink to support setting this flag >> >>>>>>>> > I think it makes sense to add support in Spark's >> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >> based on (user) inputs. >> >>>>>>>> > >> >>>>>>>> > WDYT? >> >>>>>>>> > >> >>>>>>>> > Regards, >> >>>>>>>> > Gabor >> >