Fair. On scalability, the local actions are intended for testing, development, and small tables where adding Spark as a dependency might be a bit of an overkill. They're definitely not meant as a replacement for Spark. I see the point that some folks might overestimate the scalability of purely local implementations. Also, it is possible to write fast and robust local implementations, but it is not easy.
On simplicity, I agree with your argument of doing things one way. On the other hand, local actions are the definition of simplicity whereas Spark/Flink are not. Today, we also have actions written in Flink, but clearly we need it in the Flink context where we cannot use Spark. So the question is, whether there is an actual need for non-distributed actions. If the community doesn't see a strong need for this outside of what Spark/Flink in their local modes already provide, I'd opt for the simplest approach, which is to write zero lines of code :) Cheers, Max On Thu, Feb 26, 2026 at 5:08 PM Russell Spitzer <[email protected]> wrote: > > I'm not sure we can write a stable version for all of those, DeleteOrphans > and RewriteDataFiles are the big ones for me which I think would break down > pretty quickly at scale. > > My other concern is that I generally don't like when there are two ways to do > something. On thought here is, do we get far enough if we just have Spark in > Local mode for our already existing options? That wouldn't require a new > structure but users would need Spark on the classpath. > > On Thu, Feb 26, 2026 at 10:02 AM Maximilian Michels <[email protected]> wrote: >> >> Hey Russell, >> >> I agree, Table API already has ExpireSnapshots and RewriteManifests. >> In that case, the wrappers add two things on top: >> >> 1. Result reporting with actual delete counts across the different >> file types. The current table API doesn't return a result object. >> 2. Consistent API: ActionsProvider would aggregate all available local >> actions in one place for consumers like (CLI tools, testing, etc.). >> >> The more interesting actions are the ones without Table API >> equivalents: DeleteOrphanFiles, RewriteTablePath, RewriteDataFiles. >> >> I think it would be useful to be able to run all actions without Spark >> dependencies. What do you think? >> >> Cheers, >> Max >> >> >> On Wed, Feb 25, 2026 at 8:43 PM Russell Spitzer >> <[email protected]> wrote: >> > >> > So for those first two they already exist in our Table.java API >> > >> > table.expireSnapshots() >> > .expireOlderThan(tsToExpire) >> > .commit(); >> > >> > table.rewriteManifests() >> > .commit(); >> > >> > Only RewriteTablePath doesn't have a local version yet but I think we >> > could possibly add that >> > >> > What were you thinking of adding to the existing apis? >> > >> > On Wed, Feb 25, 2026 at 2:17 AM Maximilian Michels <[email protected]> wrote: >> >> >> >> Hi Russell, >> >> >> >> Exactly, for many actions this is mostly plumbing to make the existing >> >> functionality available. >> >> >> >> >Which ones would you like to add implementations for? >> >> >> >> We can start with some simple ones, e.g. ExpireSnapshots, >> >> RewriteManifests, RewriteTablePath. >> >> >> >> -Max >> >> >> >> >> >> On Tue, Feb 24, 2026 at 5:03 PM Russell Spitzer >> >> <[email protected]> wrote: >> >> > >> >> > We already do have non-distributed versions for a bunch of the >> >> > functionality in core (that's what the actions were based on) so I >> >> > don't think this is a wild idea. Which ones would you like to add >> >> > implementations for? >> >> > >> >> > On Tue, Feb 24, 2026 at 9:23 AM Maximilian Michels <[email protected]> >> >> > wrote: >> >> >> >> >> >> Hi everyone, >> >> >> >> >> >> I've been looking at the Iceberg Actions [1] and noticed many of them >> >> >> don't fundamentally require a distributed engine. >> >> >> >> >> >> Apart from RewriteDataFiles, most of the maintenance tasks are rather >> >> >> lightweight in the processing department. Some of them could probably >> >> >> run faster and with fewer resources locally, backed by a thread pool. >> >> >> >> >> >> I wonder whether Iceberg could benefit from a local implementation for >> >> >> ActionsProvider [2]. We have a lot of the building blocks for these >> >> >> already available in the core. >> >> >> >> >> >> Granted, there are scalability limitations for large tables. Also, >> >> >> it's often more convenient to use existing (distributed) compute >> >> >> infrastructure. Yet, there are use cases where distributed computing >> >> >> isn't strictly required. For example: >> >> >> >> >> >> - CLI tooling >> >> >> - CI/CD pipelines and automation scripts >> >> >> - REST catalog backends which want to run maintenance internally >> >> >> - Small tables in general >> >> >> - Environments where Flink/Spark are not available >> >> >> >> >> >> I'm curious to hear your thoughts. >> >> >> >> >> >> Cheers, >> >> >> Max >> >> >> >> >> >> [1] >> >> >> https://github.com/apache/iceberg/tree/501824f0c0032b3225b0fe52b904756f0fe5c589/api/src/main/java/org/apache/iceberg/actions >> >> >> [2] >> >> >> https://github.com/apache/iceberg/blob/501824f0c0032b3225b0fe52b904756f0fe5c589/api/src/main/java/org/apache/iceberg/actions/ActionsProvider.java#L24
