Thanks for the response, JB!

This could be a responsibility of the catalog and in turn a TMS, I agree.
However, that seems more a mig/long-term solution, while the Spark
expire_snapshots procedure is already there, the Java core implementation
to clean expired specs and schemas is already there within RemoveSnapshots
API, we just have to connect the dots by exposing a boolean flag through
the procedure (same for Flink).
We could still expect many users/vendors in my opinion to keep using Spark
procedures for table maintenance for a long time and this low-risk change
could help them out. There seemed to be other people sharing this thinking
and being interested in this change, hence I gave this conversation another
go.

LMK WDYT!

Regards,
Gabor Kaszab

Jean-Baptiste Onofré <j...@nanthrax.net> ezt írta (időpont: 2025. júl. 3.,
Cs, 9:57):

> Hi Gabor
>
> I would consider cleanExpiredMetadata as a table maintenance procedure.
> So, I agree that it should be managed by a catalog (as part of catalog
> policies and TMS). I'm not against to switch the cleanExpiredMetadata
> flag to true, and let the query engine and the catalog deal with that.
>
> Regards
> JB
>
> On Thu, Jul 3, 2025 at 8:32 AM Gábor Kaszab <gaborkas...@apache.org>
> wrote:
> >
> > Hi Iceberg Community,
> >
> > It's been a while since the last activity on this thread but let me bump
> this conversation because there were people showing some interest in giving
> a way of switching `cleanExpiredMetadata` through procedures (Manu, Peter,
> Pucheng).
> > I understand the long term goal is to delegate such functionality to
> catalogs instead, but could we reconsider this addition for the shorter
> term?
> >
> > Regards,
> > Gabor Kaszab
> >
> > Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025.
> máj. 12., H, 16:14):
> >>
> >> Thanks all for the discussion. I also agree that we should make this
> behavior turned off by default. And I would also love to see this flag be
> added to the Spark/ Flink procedure. I think having this feature available
> on the client side seems more achievable in the short run and designing a
> server side solution might take more time (i.e. spec change, vendor
> implementation etc).
> >>
> >> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
> >>>
> >>> Thanks for the responses!
> >>>
> >>> My concern is the same, Manu, Peter: many stakeholders in this
> community don't have a catalog that is capable of executing table
> maintenance (e.g. HiveCatalog) and rely on the Spark procedures and actions
> for this purpose. I feel that we should still give them the new
> functionality to clean expired metadata (specs, schemas) by extending the
> Spark and Flink procedures.
> >>>
> >>> Regards,
> >>> Gabor
> >>>
> >>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <
> peter.vary.apa...@gmail.com> wrote:
> >>>>
> >>>> I know of several companies who are using either scheduled stored
> procedures or the existing actions to maintain production tables.
> >>>> I don't think we should deprecate them until there is a viable open
> solution for them.
> >>>>
> >>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc.
> 19., Sze, 17:52):
> >>>>>
> >>>>> I think a catalog service can also use Spark/Flink procedures for
> table maintenance, to utilize existing systems and cluster resources.
> >>>>>
> >>>>> If we no longer support new functionality in Spark/Flink procedures,
> we are effectively deprecating them, right?
> >>>>>
> >>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:
> >>>>>>
> >>>>>> Thanks for the responses so far!
> >>>>>>
> >>>>>> Sure, keeping the default as false makes sense because this is a
> new feature, so let's be on the safe side.
> >>>>>>
> >>>>>> About exposing setting the flag in the Spark action/procedure and
> also via Flink:
> >>>>>> I believe currently there are a number of vendors that don't have a
> catalog that is capable of performing table maintenance. We for instance
> advise our users to use spark procedures for table maintenance. Hence, it
> would come quite handy for us to also have a way to control the
> functionality behind the 'cleanExpiredMetadata' flag through the
> expire_snapshots procedure. Since the functionality is already there in the
> Java ExpireSnapshots API, this seems a low effort exercise.
> >>>>>> I'd like to avoid telling the users to call the Java API directly,
> but if extending the procedure is not an option, and also the used catalog
> implementation doesn't give support for this, I don't see what other
> possibilities we have here.
> >>>>>> Taking these into consideration, would it be possible to allow
> extending the Spark and Flink with the support of setting this flag?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Gabor
> >>>>>>
> >>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> I don't think it is necessary to either make cleanup the default
> or to expose the flag in Spark or other engines.
> >>>>>>>
> >>>>>>> Right now, catalogs are taking on a lot more responsibility for
> things like snapshot expiration, orphan file cleanup, and schema or
> partition spec removal. Ideally, those are tasks that catalogs handle
> rather than having clients run them, but right now we have systems for
> keeping tables clean (i.e. expiring snapshots) that are built using clients
> rather than being controlled through catalogs. That's not a problem and we
> want to continue to support them, but I also don't think that we should
> make the problem worse. I think we should consider schema and partition
> spec cleanup to be catalog service tasks, so we should not spend much
> effort to make them easily available to users. And we should not make them
> the default behavior because we don't want clients removing these manually
> and duplicating work on the client and in REST services.
> >>>>>>>
> >>>>>>> Ryan
> >>>>>>>
> >>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >>>>>>>>
> >>>>>>>> Hi Gabor
> >>>>>>>>
> >>>>>>>> I think the question is "when". As it's a behavior change, I don't
> >>>>>>>> think we should do that on a "minor" release, else users would be
> >>>>>>>> "surprised".
> >>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x
> and
> >>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> JB
> >>>>>>>>
> >>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <
> gaborkas...@apache.org> wrote:
> >>>>>>>> >
> >>>>>>>> > Hi Iceberg Community,
> >>>>>>>> >
> >>>>>>>> > There were recent additions to RemoveSnapshots to expire the
> unused partition specs and schemas. This is controlled by a flag called
> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
> and Flink don't offer a way to set this flag currently.
> >>>>>>>> >
> >>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
> >>>>>>>> > I'm wondering if it's desired by the community to default this
> flag to true. The effect of that would be that each snapshot expiration
> would also clean up the unused partition specs and schemas too. This
> functionality is quite new so this might need some extra confidence by the
> community before turning on by default but I think it's worth a
> consideration.
> >>>>>>>> >
> >>>>>>>> > 2) Spark and Flink to support setting this flag
> >>>>>>>> > I think it makes sense to add support in Spark's
> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
> based on (user) inputs.
> >>>>>>>> >
> >>>>>>>> > WDYT?
> >>>>>>>> >
> >>>>>>>> > Regards,
> >>>>>>>> > Gabor
>

Reply via email to