Re: [PROPOSAL] Object store functionality

Yufei Gu Thu, 08 Jan 2026 20:58:18 -0800

>
> The getMetadataFileBatches method has a space complexity of `O(PM + S + ST
> + PST)`.
> Same thing for the getMetadataTaskStream method.
> The getManifestTaskStream method has a space complexity of `O(UM)`.
> The handleTask method has a space complexity of `O(UM + PM + S + ST + PST +
> T)`



Pierre, thanks for the analysis. I think this response is addressing a
slightly different issue than the one originally raised. The initial
concern was that embedding base64 encoded manifest files causes excessive
heap usage. The new complexity breakdown instead argues that operations
scale with table history and activity. If the concern is really about the
cost of operating on large tables with long metadata history, that seems
like a broader, pre existing characteristic rather than a defect in this
specific approach. In that case, it may be clearer to track this as a
separate issue or PR. For reference, there were discussions of moving the
heavy workloads like this out of Polaris instance to a delegation service.

Yufei


On Thu, Jan 8, 2026 at 2:04 AM Pierre Laporte <[email protected]> wrote:

> As far as I can tell, here is the space complexity for each method.  The
> names used correspond to:
>
> * PM = number of previous metadata files
> * S = number of snapshots
> * ST = number of statistics files
> * PST = number of partition statistics files
> * UM = number of unique manifest files across all snapshots
> * T = total number of created TaskEntities
>
> The getMetadataFileBatches method has a space complexity of `O(PM + S + ST
> + PST)`.
> Same thing for the getMetadataTaskStream method.
> The getManifestTaskStream method has a space complexity of `O(UM)`.
> The handleTask method has a space complexity of `O(UM + PM + S + ST + PST +
> T)`
>
> Based on those elements, it is clear that the current implementation will
> run into heap pressure for tables with many snapshots and frequent updates,
> or tables with long metadata history.
>
> On Wed, Jan 7, 2026 at 9:55 PM Yufei Gu <[email protected]> wrote:
>
> > Hi all,
> >
> > After taking a closer look, I am not sure the issue as currently
> described
> > is actually valid.
> >
> > The base64 encoded manifest objects[1] being discussed are not the
> manifest
> > files themselves. They are objects representing manifest files, which can
> > be reconstructed from the manifest entries stored in the ManifestList
> > files. As a result, the in memory footprint should be roughly equivalent
> to
> > the size of a single manifest list file per snapshot, plus some
> additional
> > base64 encoding overhead. That overhead does not seem significant enough
> on
> > its own to explain large heap pressure.
> >
> > This pattern is also handled in practice today. For example, multiple
> Spark
> > procedures/actions and Spark planning process these manifest
> > representations within a single node, typically the driver, without
> > materializing full manifest files in memory. One concrete example is
> here:
> >
> >
> https://github.com/apache/iceberg/blob/bf1074ff373c929614a3f92dd4e46780028ac1ca/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java#L290
> >
> > Given this, I am not convinced that embedding these manifest
> > representations is inherently problematic from a memory perspective. If
> > there are concrete scenarios where this still leads to excessive memory
> > usage, it would be helpful to clarify where the amplification happens
> > beyond what is already expected from manifest list processing.
> >
> > Happy to be corrected if I am missing something, but wanted to share this
> > observation before we anchor further design decisions on this assumption.
> >
> > 1.
> >
> >
> https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L194
> >
> > Yufei
> >
> >
> > On Tue, Jan 6, 2026 at 8:46 PM Yufei Gu <[email protected]> wrote:
> >
> > > Thanks Adam and Robert for the replies.
> > >
> > > Just to make sure I am understanding this correctly.
> > >
> > > The core issue the proposal is addressing is described in
> > > https://github.com/apache/polaris/issues/2365 that, today, the full
> > > binary Iceberg manifest files, base64 encoded, are embedded in the task
> > > entities. As a result, when a purge runs, all manifests for a table end
> > up
> > > being materialized in memory at once. This behavior creates significant
> > > heap pressure and may lead to out of memory failures during purge
> > > operations.
> > >
> > > Please let me know if this matches your intent, or if I am missing
> > > anything.
> > >
> > > Yufei
> > >
> > >
> > > On Sat, Dec 20, 2025 at 4:53 AM Robert Stupp <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Thanks Adam! You're spot on!
> > >>
> > >> I wanted to separate the discussions about this functionality and the
> > >> related async & reliable tasks proposal.
> > >>
> > >> The functionality of this one is generally intended for long(er)
> > >> running operations against object stores, and already provides the
> > >> necessary functionality to fix the existing OOM issue.
> > >>
> > >> Robert
> > >>
> > >> [1] https://lists.apache.org/thread/kqm0w38p7bnojq455yz7d2vdfp6ky1h7
> > >>
> > >> On Fri, Dec 19, 2025 at 3:43 PM Adam Christian
> > >> <[email protected]> wrote:
> > >> >
> > >> > Howdy Robert,
> > >> >
> > >> > I reviewed the PR and it is very clean. I really enjoy the
> simplicity
> > of
> > >> > the FileOperations interface. I think that's going to be a great
> > >> extension
> > >> > point.
> > >> >
> > >> > One of the things I was wondering about was what to do with the
> > current
> > >> > purge implementation. I understand that it has a high likelihood of
> > >> having
> > >> > an Out of Memory exception [1]. I'm wondering if your idea was to
> > build
> > >> > this, then replace the current implementation. I'd love to ensure
> that
> > >> we
> > >> > have a plan for one clean, stable implementation.
> > >> >
> > >> > [1] - https://github.com/apache/polaris/issues/2365
> > >> >
> > >> > Go community,
> > >> >
> > >> > Adam
> > >> >
> > >> > On Tue, Dec 16, 2025 at 10:25 AM Adam Christian <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Yufei,
> > >> > >
> > >> > > Great questions!
> > >> > >
> > >> > > From what I can see in the PR, here are the answers to your
> > questions:
> > >> > > 1. The first major scenario is improving the memory concerns with
> > >> purge.
> > >> > > That's important to stabilize a core use case in the service.
> > >> > > 2. These are related specifically to file operations. I cannot
> see a
> > >> way
> > >> > > that it would be broader than that.
> > >> > >
> > >> > > Go community,
> > >> > >
> > >> > > Adam
> > >> > >
> > >> > > On Mon, Dec 15, 2025, 3:20 PM Yufei Gu <[email protected]>
> > wrote:
> > >> > >
> > >> > >> Hi Robert,
> > >> > >>
> > >> > >> Thanks for sharing the proposal and the PR. Before diving deeper
> > >> into the
> > >> > >> API shape, I was hoping to better understand the intended use
> cases
> > >> you
> > >> > >> have in mind:
> > >> > >>
> > >> > >> 1. What concrete scenarios are you primarily targeting with these
> > >> > >> long-running object store operations?
> > >> > >> 2. Are these mostly expected to be file/object-level maintenance
> > >> tasks
> > >> > >> (e.g. purge, cleanup), or do you envision broader categories of
> > >> operations
> > >> > >> leveraging the same abstraction?
> > >> > >>
> > >> > >> Having a clearer picture of the motivating use cases would help
> > >> evaluate
> > >> > >> the right level of abstraction and where this should live
> > >> architecturally.
> > >> > >>
> > >> > >> Looking forward to the discussion.
> > >> > >>
> > >> > >> Yufei
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Dec 12, 2025 at 3:48 AM Robert Stupp <[email protected]>
> > wrote:
> > >> > >>
> > >> > >> > Hi all,
> > >> > >> >
> > >> > >> > I'd like to propose an API and corresponding implementation for
> > >> (long
> > >> > >> > running) object store operations.
> > >> > >> >
> > >> > >> > It provides a CPU and heap-friendly API and implementation to
> > work
> > >> > >> > against object stores. It is built in a way to provide
> > "pluggable"
> > >> > >> > functionality. What I mean is this (Java pseudo code):
> > >> > >> > ---
> > >> > >> > FileOperations fileOps =
> > >> > >> > fileOperationsFactory.createFileOperations(fileIoInstance);
> > >> > >> > Stream<FileSpec> allIcebergTableFiles = fileOps.
> > >> > >> >     identifyIcebergTableFiles(metadataLocation);
> > >> > >> > PurgeStats purged = fileOps.purge(allIcebergTableFiles);
> > >> > >> > // or simpler:
> > >> > >> > PurgeStats purged =
> fileOps.purgeIcebergTable(metadataLocation);
> > >> > >> > // or similarly for Iceberg views
> > >> > >> > PurgeStats purged = fileOps.purgeIcebergView(metadataLocation);
> > >> > >> > // or to purge all files underneath a prefix
> > >> > >> > PurgeStats purged = fileOps.purge(fileOps.findFiles(prefix));
> > >> > >> > ---
> > >> > >> >
> > >> > >> > Not mentioned in the pseudo code is the ability to rate-limit
> the
> > >> > >> > number of purged files or batch-deletions and configure the
> > >> deletion
> > >> > >> > batch-size.
> > >> > >> >
> > >> > >> > The PR already contains tests against an on-heap object store
> > mock
> > >> and
> > >> > >> > integration tests against S3/GCS/Azure emulators.
> > >> > >> >
> > >> > >> > More details can be found in the README [2] included in the PR
> > and
> > >> of
> > >> > >> > course in the code in the PR.
> > >> > >> >
> > >> > >> > Robert
> > >> > >> >
> > >> > >> > [1] https://github.com/apache/polaris/pull/3256
> > >> > >> > [2]
> > >> > >> >
> > >> > >>
> > >>
> >
> https://github.com/snazy/polaris/blob/obj-store-ops/storage/files/README.md
> > >> > >> >
> > >> > >>
> > >> > >
> > >>
> > >
> >
>

Re: [PROPOSAL] Object store functionality

Reply via email to