I’m not seeing how Spark procedure contradicts to the catalog solution.
Catalogs can make decisions based on policies and pass down parameters to
spark procedures to execute. In addition, it can be used by all catalogs
and table maintenance systems.

Regards,
Manu

Gábor Kaszab <gaborkas...@gmail.com>于2025年7月7日 周一21:31写道:

> Thanks for the response, JB!
>
> This could be a responsibility of the catalog and in turn a TMS, I agree.
> However, that seems more a mig/long-term solution, while the Spark
> expire_snapshots procedure is already there, the Java core implementation
> to clean expired specs and schemas is already there within RemoveSnapshots
> API, we just have to connect the dots by exposing a boolean flag through
> the procedure (same for Flink).
> We could still expect many users/vendors in my opinion to keep using Spark
> procedures for table maintenance for a long time and this low-risk change
> could help them out. There seemed to be other people sharing this thinking
> and being interested in this change, hence I gave this conversation another
> go.
>
> LMK WDYT!
>
> Regards,
> Gabor Kaszab
>
> Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3.,
> Cs, 9:57):
>
>> Hi Gabor
>>
>> I would consider cleanExpiredMetadata as a table maintenance procedure.
>> So, I agree that it should be managed by a catalog (as part of catalog
>> policies and TMS). I'm not against to switch the cleanExpiredMetadata
>> flag to true, and let the query engine and the catalog deal with that.
>>
>> Regards
>> JB
>>
>> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >
>> > Hi Iceberg Community,
>> >
>> > It's been a while since the last activity on this thread but let me
>> bump this conversation because there were people showing some interest in
>> giving a way of switching `cleanExpiredMetadata` through procedures (Manu,
>> Peter, Pucheng).
>> > I understand the long term goal is to delegate such functionality to
>> catalogs instead, but could we reconsider this addition for the shorter
>> term?
>> >
>> > Regards,
>> > Gabor Kaszab
>> >
>> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025.
>> máj. 12., H, 16:14):
>> >>
>> >> Thanks all for the discussion. I also agree that we should make this
>> behavior turned off by default. And I would also love to see this flag be
>> added to the Spark/ Flink procedure. I think having this feature available
>> on the client side seems more achievable in the short run and designing a
>> server side solution might take more time (i.e. spec change, vendor
>> implementation etc).
>> >>
>> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >>>
>> >>> Thanks for the responses!
>> >>>
>> >>> My concern is the same, Manu, Peter: many stakeholders in this
>> community don't have a catalog that is capable of executing table
>> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions
>> for this purpose. I feel that we should still give them the new
>> functionality to clean expired metadata (specs, schemas) by extending the
>> Spark and Flink procedures.
>> >>>
>> >>> Regards,
>> >>> Gabor
>> >>>
>> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <
>> peter.vary.apa...@gmail.com> wrote:
>> >>>>
>> >>>> I know of several companies who are using either scheduled stored
>> procedures or the existing actions to maintain production tables.
>> >>>> I don't think we should deprecate them until there is a viable open
>> solution for them.
>> >>>>
>> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc.
>> 19., Sze, 17:52):
>> >>>>>
>> >>>>> I think a catalog service can also use Spark/Flink procedures for
>> table maintenance, to utilize existing systems and cluster resources.
>> >>>>>
>> >>>>> If we no longer support new functionality in Spark/Flink
>> procedures, we are effectively deprecating them, right?
>> >>>>>
>> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:
>> >>>>>>
>> >>>>>> Thanks for the responses so far!
>> >>>>>>
>> >>>>>> Sure, keeping the default as false makes sense because this is a
>> new feature, so let's be on the safe side.
>> >>>>>>
>> >>>>>> About exposing setting the flag in the Spark action/procedure and
>> also via Flink:
>> >>>>>> I believe currently there are a number of vendors that don't have
>> a catalog that is capable of performing table maintenance. We for instance
>> advise our users to use spark procedures for table maintenance. Hence, it
>> would come quite handy for us to also have a way to control the
>> functionality behind the 'cleanExpiredMetadata' flag through the
>> expire_snapshots procedure. Since the functionality is already there in the
>> Java ExpireSnapshots API, this seems a low effort exercise.
>> >>>>>> I'd like to avoid telling the users to call the Java API directly,
>> but if extending the procedure is not an option, and also the used catalog
>> implementation doesn't give support for this, I don't see what other
>> possibilities we have here.
>> >>>>>> Taking these into consideration, would it be possible to allow
>> extending the Spark and Flink with the support of setting this flag?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Gabor
>> >>>>>>
>> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com>
>> wrote:
>> >>>>>>>
>> >>>>>>> I don't think it is necessary to either make cleanup the default
>> or to expose the flag in Spark or other engines.
>> >>>>>>>
>> >>>>>>> Right now, catalogs are taking on a lot more responsibility for
>> things like snapshot expiration, orphan file cleanup, and schema or
>> partition spec removal. Ideally, those are tasks that catalogs handle
>> rather than having clients run them, but right now we have systems for
>> keeping tables clean (i.e. expiring snapshots) that are built using clients
>> rather than being controlled through catalogs. That's not a problem and we
>> want to continue to support them, but I also don't think that we should
>> make the problem worse. I think we should consider schema and partition
>> spec cleanup to be catalog service tasks, so we should not spend much
>> effort to make them easily available to users. And we should not make them
>> the default behavior because we don't want clients removing these manually
>> and duplicating work on the client and in REST services.
>> >>>>>>>
>> >>>>>>> Ryan
>> >>>>>>>
>> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <
>> j...@nanthrax.net> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Gabor
>> >>>>>>>>
>> >>>>>>>> I think the question is "when". As it's a behavior change, I
>> don't
>> >>>>>>>> think we should do that on a "minor" release, else users would be
>> >>>>>>>> "surprised".
>> >>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x
>> and
>> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>> >>>>>>>>
>> >>>>>>>> Regards
>> >>>>>>>> JB
>> >>>>>>>>
>> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <
>> gaborkas...@apache.org> wrote:
>> >>>>>>>> >
>> >>>>>>>> > Hi Iceberg Community,
>> >>>>>>>> >
>> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the
>> unused partition specs and schemas. This is controlled by a flag called
>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>> and Flink don't offer a way to set this flag currently.
>> >>>>>>>> >
>> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>> >>>>>>>> > I'm wondering if it's desired by the community to default this
>> flag to true. The effect of that would be that each snapshot expiration
>> would also clean up the unused partition specs and schemas too. This
>> functionality is quite new so this might need some extra confidence by the
>> community before turning on by default but I think it's worth a
>> consideration.
>> >>>>>>>> >
>> >>>>>>>> > 2) Spark and Flink to support setting this flag
>> >>>>>>>> > I think it makes sense to add support in Spark's
>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>> based on (user) inputs.
>> >>>>>>>> >
>> >>>>>>>> > WDYT?
>> >>>>>>>> >
>> >>>>>>>> > Regards,
>> >>>>>>>> > Gabor
>>
>

Reply via email to