I know of several companies who are using either scheduled stored
procedures or the existing actions to maintain production tables.
I don't think we should deprecate them until there is a viable open
solution for them.

Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. 19.,
Sze, 17:52):

> I think a catalog service can also use Spark/Flink procedures for table
> maintenance, to utilize existing systems and cluster resources.
>
> If we no longer support new functionality in Spark/Flink procedures, we
> are effectively deprecating them, right?
>
> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:
>
>> Thanks for the responses so far!
>>
>> Sure, keeping the default as false makes sense because this is a new
>> feature, so let's be on the safe side.
>>
>> About exposing setting the flag in the Spark action/procedure and also
>> via Flink:
>> I believe currently there are a number of vendors that don't have a
>> catalog that is capable of performing table maintenance. We for instance
>> advise our users to use spark procedures for table maintenance. Hence, it
>> would come quite handy for us to also have a way to control the
>> functionality behind the 'cleanExpiredMetadata' flag through the
>> expire_snapshots procedure. Since the functionality is already there in the
>> Java ExpireSnapshots API, this seems a low effort exercise.
>> I'd like to avoid telling the users to call the Java API directly, but if
>> extending the procedure is not an option, and also the used catalog
>> implementation doesn't give support for this, I don't see what other
>> possibilities we have here.
>> Taking these into consideration, would it be possible to allow extending
>> the Spark and Flink with the support of setting this flag?
>>
>> Thanks,
>> Gabor
>>
>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:
>>
>>> I don't think it is necessary to either make cleanup the default or to
>>> expose the flag in Spark or other engines.
>>>
>>> Right now, catalogs are taking on a lot more responsibility for things
>>> like snapshot expiration, orphan file cleanup, and schema or partition spec
>>> removal. Ideally, those are tasks that catalogs handle rather than having
>>> clients run them, but right now we have systems for keeping tables clean
>>> (i.e. expiring snapshots) that are built using clients rather than being
>>> controlled through catalogs. That's not a problem and we want to continue
>>> to support them, but I also don't think that we should make the problem
>>> worse. I think we should consider schema and partition spec cleanup to be
>>> catalog service tasks, so we should not spend much effort to make them
>>> easily available to users. And we should not make them the default behavior
>>> because we don't want clients removing these manually and duplicating work
>>> on the client and in REST services.
>>>
>>> Ryan
>>>
>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi Gabor
>>>>
>>>> I think the question is "when". As it's a behavior change, I don't
>>>> think we should do that on a "minor" release, else users would be
>>>> "surprised".
>>>> I would propose to keep the current behavior on Iceberg Java 1.x and
>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org>
>>>> wrote:
>>>> >
>>>> > Hi Iceberg Community,
>>>> >
>>>> > There were recent additions to RemoveSnapshots to expire the unused
>>>> partition specs and schemas. This is controlled by a flag called
>>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>>>> and Flink don't offer a way to set this flag currently.
>>>> >
>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>>>> > I'm wondering if it's desired by the community to default this flag
>>>> to true. The effect of that would be that each snapshot expiration would
>>>> also clean up the unused partition specs and schemas too. This
>>>> functionality is quite new so this might need some extra confidence by the
>>>> community before turning on by default but I think it's worth a
>>>> consideration.
>>>> >
>>>> > 2) Spark and Flink to support setting this flag
>>>> > I think it makes sense to add support in Spark's
>>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>>>> based on (user) inputs.
>>>> >
>>>> > WDYT?
>>>> >
>>>> > Regards,
>>>> > Gabor
>>>>
>>>

Reply via email to