Re: cleanExpiredMetadata in RemoveSnapshots

Ryan Blue Mon, 07 Jul 2025 15:47:05 -0700

I think it's reasonable to expose the options through the stored procedure.
I just don't think that we want to change to make it the default behavior.


On Mon, Jul 7, 2025 at 8:37 AM Manu Zhang <owenzhang1...@gmail.com> wrote:

> I’m not seeing how Spark procedure contradicts to the catalog solution.
> Catalogs can make decisions based on policies and pass down parameters to
> spark procedures to execute. In addition, it can be used by all catalogs
> and table maintenance systems.
>
> Regards,
> Manu
>
> Gábor Kaszab <gaborkas...@gmail.com>于2025年7月7日 周一21:31写道：
>
>> Thanks for the response, JB!
>>
>> This could be a responsibility of the catalog and in turn a TMS, I agree.
>> However, that seems more a mig/long-term solution, while the Spark
>> expire_snapshots procedure is already there, the Java core implementation
>> to clean expired specs and schemas is already there within RemoveSnapshots
>> API, we just have to connect the dots by exposing a boolean flag through
>> the procedure (same for Flink).
>> We could still expect many users/vendors in my opinion to keep using
>> Spark procedures for table maintenance for a long time and this low-risk
>> change could help them out. There seemed to be other people sharing this
>> thinking and being interested in this change, hence I gave this
>> conversation another go.
>>
>> LMK WDYT!
>>
>> Regards,
>> Gabor Kaszab
>>
>> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3.,
>> Cs, 9:57):
>>
>>> Hi Gabor
>>>
>>> I would consider cleanExpiredMetadata as a table maintenance procedure.
>>> So, I agree that it should be managed by a catalog (as part of catalog
>>> policies and TMS). I'm not against to switch the cleanExpiredMetadata
>>> flag to true, and let the query engine and the catalog deal with that.
>>>
>>> Regards
>>> JB
>>>
>>> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org>
>>> wrote:
>>> >
>>> > Hi Iceberg Community,
>>> >
>>> > It's been a while since the last activity on this thread but let me
>>> bump this conversation because there were people showing some interest in
>>> giving a way of switching `cleanExpiredMetadata` through procedures (Manu,
>>> Peter, Pucheng).
>>> > I understand the long term goal is to delegate such functionality to
>>> catalogs instead, but could we reconsider this addition for the shorter
>>> term?
>>> >
>>> > Regards,
>>> > Gabor Kaszab
>>> >
>>> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025.
>>> máj. 12., H, 16:14):
>>> >>
>>> >> Thanks all for the discussion. I also agree that we should make this
>>> behavior turned off by default. And I would also love to see this flag be
>>> added to the Spark/ Flink procedure. I think having this feature available
>>> on the client side seems more achievable in the short run and designing a
>>> server side solution might take more time (i.e. spec change, vendor
>>> implementation etc).
>>> >>
>>> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org>
>>> wrote:
>>> >>>
>>> >>> Thanks for the responses!
>>> >>>
>>> >>> My concern is the same, Manu, Peter: many stakeholders in this
>>> community don't have a catalog that is capable of executing table
>>> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions
>>> for this purpose. I feel that we should still give them the new
>>> functionality to clean expired metadata (specs, schemas) by extending the
>>> Spark and Flink procedures.
>>> >>>
>>> >>> Regards,
>>> >>> Gabor
>>> >>>
>>> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <
>>> peter.vary.apa...@gmail.com> wrote:
>>> >>>>
>>> >>>> I know of several companies who are using either scheduled stored
>>> procedures or the existing actions to maintain production tables.
>>> >>>> I don't think we should deprecate them until there is a viable open
>>> solution for them.
>>> >>>>
>>> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025.
>>> márc. 19., Sze, 17:52):
>>> >>>>>
>>> >>>>> I think a catalog service can also use Spark/Flink procedures for
>>> table maintenance, to utilize existing systems and cluster resources.
>>> >>>>>
>>> >>>>> If we no longer support new functionality in Spark/Flink
>>> procedures, we are effectively deprecating them, right?
>>> >>>>>
>>> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道：
>>> >>>>>>
>>> >>>>>> Thanks for the responses so far!
>>> >>>>>>
>>> >>>>>> Sure, keeping the default as false makes sense because this is a
>>> new feature, so let's be on the safe side.
>>> >>>>>>
>>> >>>>>> About exposing setting the flag in the Spark action/procedure and
>>> also via Flink:
>>> >>>>>> I believe currently there are a number of vendors that don't have
>>> a catalog that is capable of performing table maintenance. We for instance
>>> advise our users to use spark procedures for table maintenance. Hence, it
>>> would come quite handy for us to also have a way to control the
>>> functionality behind the 'cleanExpiredMetadata' flag through the
>>> expire_snapshots procedure. Since the functionality is already there in the
>>> Java ExpireSnapshots API, this seems a low effort exercise.
>>> >>>>>> I'd like to avoid telling the users to call the Java API
>>> directly, but if extending the procedure is not an option, and also the
>>> used catalog implementation doesn't give support for this, I don't see what
>>> other possibilities we have here.
>>> >>>>>> Taking these into consideration, would it be possible to allow
>>> extending the Spark and Flink with the support of setting this flag?
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Gabor
>>> >>>>>>
>>> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com>
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> I don't think it is necessary to either make cleanup the default
>>> or to expose the flag in Spark or other engines.
>>> >>>>>>>
>>> >>>>>>> Right now, catalogs are taking on a lot more responsibility for
>>> things like snapshot expiration, orphan file cleanup, and schema or
>>> partition spec removal. Ideally, those are tasks that catalogs handle
>>> rather than having clients run them, but right now we have systems for
>>> keeping tables clean (i.e. expiring snapshots) that are built using clients
>>> rather than being controlled through catalogs. That's not a problem and we
>>> want to continue to support them, but I also don't think that we should
>>> make the problem worse. I think we should consider schema and partition
>>> spec cleanup to be catalog service tasks, so we should not spend much
>>> effort to make them easily available to users. And we should not make them
>>> the default behavior because we don't want clients removing these manually
>>> and duplicating work on the client and in REST services.
>>> >>>>>>>
>>> >>>>>>> Ryan
>>> >>>>>>>
>>> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Hi Gabor
>>> >>>>>>>>
>>> >>>>>>>> I think the question is "when". As it's a behavior change, I
>>> don't
>>> >>>>>>>> think we should do that on a "minor" release, else users would
>>> be
>>> >>>>>>>> "surprised".
>>> >>>>>>>> I would propose to keep the current behavior on Iceberg Java
>>> 1.x and
>>> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>>> >>>>>>>>
>>> >>>>>>>> Regards
>>> >>>>>>>> JB
>>> >>>>>>>>
>>> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <
>>> gaborkas...@apache.org> wrote:
>>> >>>>>>>> >
>>> >>>>>>>> > Hi Iceberg Community,
>>> >>>>>>>> >
>>> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the
>>> unused partition specs and schemas. This is controlled by a flag called
>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>>> and Flink don't offer a way to set this flag currently.
>>> >>>>>>>> >
>>> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>>> >>>>>>>> > I'm wondering if it's desired by the community to default
>>> this flag to true. The effect of that would be that each snapshot
>>> expiration would also clean up the unused partition specs and schemas too.
>>> This functionality is quite new so this might need some extra confidence by
>>> the community before turning on by default but I think it's worth a
>>> consideration.
>>> >>>>>>>> >
>>> >>>>>>>> > 2) Spark and Flink to support setting this flag
>>> >>>>>>>> > I think it makes sense to add support in Spark's
>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>>> based on (user) inputs.
>>> >>>>>>>> >
>>> >>>>>>>> > WDYT?
>>> >>>>>>>> >
>>> >>>>>>>> > Regards,
>>> >>>>>>>> > Gabor
>>>
>>

Re: cleanExpiredMetadata in RemoveSnapshots

Reply via email to