Hi Iceberg Community,

It's been a while since the last activity on this thread but let me bump
this conversation because there were people showing some interest in giving
a way of switching `cleanExpiredMetadata` through procedures (Manu, Peter,
Pucheng).
I understand the long term goal is to delegate such functionality to
catalogs instead, but could we reconsider this addition for the shorter
term?

Regards,
Gabor Kaszab

Pucheng Yang <py...@pinterest.com.invalid> ezt írta (időpont: 2025. máj.
12., H, 16:14):

> Thanks all for the discussion. I also agree that we should make this
> behavior turned off by default. And I would also love to see this flag be
> added to the Spark/ Flink procedure. I think having this feature
> available on the client side seems more achievable in the short run and
> designing a server side solution might take more time (i.e. spec change,
> vendor implementation etc).
>
> On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org>
> wrote:
>
>> Thanks for the responses!
>>
>> My concern is the same, Manu, Peter: many stakeholders in this community
>> don't have a catalog that is capable of executing table maintenance (e.g.
>> HiveCatalog) and rely on the Spark procedures and actions for this purpose.
>> I feel that we should still give them the new functionality to clean
>> expired metadata (specs, schemas) by extending the Spark and Flink
>> procedures.
>>
>> Regards,
>> Gabor
>>
>> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> I know of several companies who are using either scheduled stored
>>> procedures or the existing actions to maintain production tables.
>>> I don't think we should deprecate them until there is a viable open
>>> solution for them.
>>>
>>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc.
>>> 19., Sze, 17:52):
>>>
>>>> I think a catalog service can also use Spark/Flink procedures for table
>>>> maintenance, to utilize existing systems and cluster resources.
>>>>
>>>> If we no longer support new functionality in Spark/Flink procedures, we
>>>> are effectively deprecating them, right?
>>>>
>>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:
>>>>
>>>>> Thanks for the responses so far!
>>>>>
>>>>> Sure, keeping the default as false makes sense because this is a new
>>>>> feature, so let's be on the safe side.
>>>>>
>>>>> About exposing setting the flag in the Spark action/procedure and also
>>>>> via Flink:
>>>>> I believe currently there are a number of vendors that don't have a
>>>>> catalog that is capable of performing table maintenance. We for instance
>>>>> advise our users to use spark procedures for table maintenance. Hence, it
>>>>> would come quite handy for us to also have a way to control the
>>>>> functionality behind the 'cleanExpiredMetadata' flag through the
>>>>> expire_snapshots procedure. Since the functionality is already there in 
>>>>> the
>>>>> Java ExpireSnapshots API, this seems a low effort exercise.
>>>>> I'd like to avoid telling the users to call the Java API directly, but
>>>>> if extending the procedure is not an option, and also the used catalog
>>>>> implementation doesn't give support for this, I don't see what other
>>>>> possibilities we have here.
>>>>> Taking these into consideration, would it be possible to allow
>>>>> extending the Spark and Flink with the support of setting this flag?
>>>>>
>>>>> Thanks,
>>>>> Gabor
>>>>>
>>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>>
>>>>>> I don't think it is necessary to either make cleanup the default or
>>>>>> to expose the flag in Spark or other engines.
>>>>>>
>>>>>> Right now, catalogs are taking on a lot more responsibility for
>>>>>> things like snapshot expiration, orphan file cleanup, and schema or
>>>>>> partition spec removal. Ideally, those are tasks that catalogs handle
>>>>>> rather than having clients run them, but right now we have systems for
>>>>>> keeping tables clean (i.e. expiring snapshots) that are built using 
>>>>>> clients
>>>>>> rather than being controlled through catalogs. That's not a problem and 
>>>>>> we
>>>>>> want to continue to support them, but I also don't think that we should
>>>>>> make the problem worse. I think we should consider schema and partition
>>>>>> spec cleanup to be catalog service tasks, so we should not spend much
>>>>>> effort to make them easily available to users. And we should not make 
>>>>>> them
>>>>>> the default behavior because we don't want clients removing these 
>>>>>> manually
>>>>>> and duplicating work on the client and in REST services.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Gabor
>>>>>>>
>>>>>>> I think the question is "when". As it's a behavior change, I don't
>>>>>>> think we should do that on a "minor" release, else users would be
>>>>>>> "surprised".
>>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x and
>>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>>>>>>>
>>>>>>> Regards
>>>>>>> JB
>>>>>>>
>>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <
>>>>>>> gaborkas...@apache.org> wrote:
>>>>>>> >
>>>>>>> > Hi Iceberg Community,
>>>>>>> >
>>>>>>> > There were recent additions to RemoveSnapshots to expire the
>>>>>>> unused partition specs and schemas. This is controlled by a flag called
>>>>>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, 
>>>>>>> Spark
>>>>>>> and Flink don't offer a way to set this flag currently.
>>>>>>> >
>>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>>>>>>> > I'm wondering if it's desired by the community to default this
>>>>>>> flag to true. The effect of that would be that each snapshot expiration
>>>>>>> would also clean up the unused partition specs and schemas too. This
>>>>>>> functionality is quite new so this might need some extra confidence by 
>>>>>>> the
>>>>>>> community before turning on by default but I think it's worth a
>>>>>>> consideration.
>>>>>>> >
>>>>>>> > 2) Spark and Flink to support setting this flag
>>>>>>> > I think it makes sense to add support in Spark's
>>>>>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>>>>>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>>>>>>> based on (user) inputs.
>>>>>>> >
>>>>>>> > WDYT?
>>>>>>> >
>>>>>>> > Regards,
>>>>>>> > Gabor
>>>>>>>
>>>>>>

Reply via email to