I know of several companies who are using either scheduled stored procedures or the existing actions to maintain production tables. I don't think we should deprecate them until there is a viable open solution for them.
Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. 19., Sze, 17:52): > I think a catalog service can also use Spark/Flink procedures for table > maintenance, to utilize existing systems and cluster resources. > > If we no longer support new functionality in Spark/Flink procedures, we > are effectively deprecating them, right? > > Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: > >> Thanks for the responses so far! >> >> Sure, keeping the default as false makes sense because this is a new >> feature, so let's be on the safe side. >> >> About exposing setting the flag in the Spark action/procedure and also >> via Flink: >> I believe currently there are a number of vendors that don't have a >> catalog that is capable of performing table maintenance. We for instance >> advise our users to use spark procedures for table maintenance. Hence, it >> would come quite handy for us to also have a way to control the >> functionality behind the 'cleanExpiredMetadata' flag through the >> expire_snapshots procedure. Since the functionality is already there in the >> Java ExpireSnapshots API, this seems a low effort exercise. >> I'd like to avoid telling the users to call the Java API directly, but if >> extending the procedure is not an option, and also the used catalog >> implementation doesn't give support for this, I don't see what other >> possibilities we have here. >> Taking these into consideration, would it be possible to allow extending >> the Spark and Flink with the support of setting this flag? >> >> Thanks, >> Gabor >> >> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> I don't think it is necessary to either make cleanup the default or to >>> expose the flag in Spark or other engines. >>> >>> Right now, catalogs are taking on a lot more responsibility for things >>> like snapshot expiration, orphan file cleanup, and schema or partition spec >>> removal. Ideally, those are tasks that catalogs handle rather than having >>> clients run them, but right now we have systems for keeping tables clean >>> (i.e. expiring snapshots) that are built using clients rather than being >>> controlled through catalogs. That's not a problem and we want to continue >>> to support them, but I also don't think that we should make the problem >>> worse. I think we should consider schema and partition spec cleanup to be >>> catalog service tasks, so we should not spend much effort to make them >>> easily available to users. And we should not make them the default behavior >>> because we don't want clients removing these manually and duplicating work >>> on the client and in REST services. >>> >>> Ryan >>> >>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>>> Hi Gabor >>>> >>>> I think the question is "when". As it's a behavior change, I don't >>>> think we should do that on a "minor" release, else users would be >>>> "surprised". >>>> I would propose to keep the current behavior on Iceberg Java 1.x and >>>> change the flag to true on Iceberg Java 2.x (after a vote). >>>> >>>> Regards >>>> JB >>>> >>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org> >>>> wrote: >>>> > >>>> > Hi Iceberg Community, >>>> > >>>> > There were recent additions to RemoveSnapshots to expire the unused >>>> partition specs and schemas. This is controlled by a flag called >>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark >>>> and Flink don't offer a way to set this flag currently. >>>> > >>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>>> > I'm wondering if it's desired by the community to default this flag >>>> to true. The effect of that would be that each snapshot expiration would >>>> also clean up the unused partition specs and schemas too. This >>>> functionality is quite new so this might need some extra confidence by the >>>> community before turning on by default but I think it's worth a >>>> consideration. >>>> > >>>> > 2) Spark and Flink to support setting this flag >>>> > I think it makes sense to add support in Spark's >>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>>> based on (user) inputs. >>>> > >>>> > WDYT? >>>> > >>>> > Regards, >>>> > Gabor >>>> >>>