Re: cleanExpiredMetadata in RemoveSnapshots

Jean-Baptiste Onofré Tue, 08 Jul 2025 02:22:45 -0700

Hi

I think it makes sense to have a procedure in spark for that. My point was
about the catalog long term solution.


So short term, +1 for a spark procedure. Long term, we should not forget
the catalog (especially for engine interoperability).

Thanks!

Regards
JB

Le lun. 7 juil. 2025 à 09:31, Gábor Kaszab <gaborkas...@gmail.com> a écrit :

> Thanks for the response, JB!
>
> This could be a responsibility of the catalog and in turn a TMS, I agree.
> However, that seems more a mig/long-term solution, while the Spark
> expire_snapshots procedure is already there, the Java core implementation
> to clean expired specs and schemas is already there within RemoveSnapshots
> API, we just have to connect the dots by exposing a boolean flag through
> the procedure (same for Flink).
> We could still expect many users/vendors in my opinion to keep using Spark
> procedures for table maintenance for a long time and this low-risk change
> could help them out. There seemed to be other people sharing this thinking
> and being interested in this change, hence I gave this conversation another
> go.
>
> LMK WDYT!
>
> Regards,
> Gabor Kaszab
>
> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3.,
> Cs, 9:57):
>
>> Hi Gabor
>>
>> I would consider cleanExpiredMetadata as a table maintenance procedure.
>> So, I agree that it should be managed by a catalog (as part of catalog
>> policies and TMS). I'm not against to switch the cleanExpiredMetadata
>> flag to true, and let the query engine and the catalog deal with that.
>>
>> Regards
>> JB
>>
>> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >
>> > Hi Iceberg Community,
>> >
>> > It's been a while since the last activity on this thread but let me
>> bump this conversation because there were people showing some interest in
>> giving a way of switching `cleanExpiredMetadata` through procedures (Manu,
>> Peter, Pucheng).
>> > I understand the long term goal is to delegate such functionality to
>> catalogs instead, but could we reconsider this addition for the shorter
>> term?
>> >
>> > Regards,
>> > Gabor Kaszab
>> >
>> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025.
>> máj. 12., H, 16:14):
>> >>
>> >> Thanks all for the discussion. I also agree that we should make this
>> behavior turned off by default. And I would also love to see this flag be
>> added to the Spark/ Flink procedure. I think having this feature available
>> on the client side seems more achievable in the short run and designing a
>> server side solution might take more time (i.e. spec change, vendor
>> implementation etc).
>> >>
>> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >>>
>> >>> Thanks for the responses!
>> >>>
>> >>> My concern is the same, Manu, Peter: many stakeholders in this
>> community don't have a catalog that is capable of executing table
>> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions
>> for this purpose. I feel that we should still give them the new
>> functionality to clean expired metadata (specs, schemas) by extending the
>> Spark and Flink procedures.
>> >>>
>> >>> Regards,
>> >>> Gabor
>> >>>
>> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >>>>
>> >>>> I know of several companies who are using either scheduled stored
>> procedures or the existing actions to maintain production tables.
>> >>>> I don't think we should deprecate them until there is a viable open
>> solution for them.
>> >>>>
>> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc.
>> 19., Sze, 17:52):
>> >>>>>
>> >>>>> I think a catalog service can also use Spark/Flink procedures for
>> table maintenance, to utilize existing systems and cluster resources.
>> >>>>>
>> >>>>> If we no longer support new functionality in Spark/Flink
>> procedures, we are effectively deprecating them, right?
>> >>>>>
>> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道：
>> >>>>>>
>> >>>>>> Thanks for the responses so far!
>> >>>>>>
>> >>>>>> Sure, keeping the default as false makes sense because this is a
>> new feature, so let's be on the safe side.
>> >>>>>>
>> >>>>>> About exposing setting the flag in the Spark action/procedure and
>> also via Flink:
>> >>>>>> I believe currently there are a number of vendors that don't have
>> a catalog that is capable of performing table maintenance. We for instance
>> advise our users to use spark procedures for table maintenance. Hence, it
>> would come quite handy for us to also have a way to control the
>> functionality behind the 'cleanExpiredMetadata' flag through the
>> expire_snapshots procedure. Since the functionality is already there in the
>> Java ExpireSnapshots API, this seems a low effort exercise.
>> >>>>>> I'd like to avoid telling the users to call the Java API directly,
>> but if extending the procedure is not an option, and also the used catalog
>> implementation doesn't give support for this, I don't see what other
>> possibilities we have here.
>> >>>>>> Taking these into consideration, would it be possible to allow
>> extending the Spark and Flink with the support of setting this flag?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Gabor
>> >>>>>>
>> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> I don't think it is necessary to either make cleanup the default
>> or to expose the flag in Spark or other engines.
>> >>>>>>>
>> >>>>>>> Right now, catalogs are taking on a lot more responsibility for
>> things like snapshot expiration, orphan file cleanup, and schema or
>> partition spec removal. Ideally, those are tasks that catalogs handle
>> rather than having clients run them, but right now we have systems for
>> keeping tables clean (i.e. expiring snapshots) that are built using clients
>> rather than being controlled through catalogs. That's not a problem and we
>> want to continue to support them, but I also don't think that we should
>> make the problem worse. I think we should consider schema and partition
>> spec cleanup to be catalog service tasks, so we should not spend much
>> effort to make them easily available to users. And we should not make them
>> the default behavior because we don't want clients removing these manually
>> and duplicating work on the client and in REST services.
>> >>>>>>>
>> >>>>>>> Ryan
>> >>>>>>>
>> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <
>> j...@nanthrax.net> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Gabor
>> >>>>>>>>
>> >>>>>>>> I think the question is "when". As it's a behavior change, I
>> don't
>> >>>>>>>> think we should do that on a "minor" release, else users would be
>> >>>>>>>> "surprised".
>> >>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x
>> and
>> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>> >>>>>>>>
>> >>>>>>>> Regards
>> >>>>>>>> JB
>> >>>>>>>>
>> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <
>> gaborkas...@apache.org> wrote:
>> >>>>>>>> >
>> >>>>>>>> > Hi Iceberg Community,
>> >>>>>>>> >
>> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the
>> unused partition specs and schemas. This is controlled by a flag called
>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>> and Flink don't offer a way to set this flag currently.
>> >>>>>>>> >
>> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>> >>>>>>>> > I'm wondering if it's desired by the community to default this
>> flag to true. The effect of that would be that each snapshot expiration
>> would also clean up the unused partition specs and schemas too. This
>> functionality is quite new so this might need some extra confidence by the
>> community before turning on by default but I think it's worth a
>> consideration.
>> >>>>>>>> >
>> >>>>>>>> > 2) Spark and Flink to support setting this flag
>> >>>>>>>> > I think it makes sense to add support in Spark's
>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>> based on (user) inputs.
>> >>>>>>>> >
>> >>>>>>>> > WDYT?
>> >>>>>>>> >
>> >>>>>>>> > Regards,
>> >>>>>>>> > Gabor
>>
>

Re: cleanExpiredMetadata in RemoveSnapshots

Reply via email to