Re: [PROPOSAL] Object store functionality

Russell Spitzer Thu, 08 Jan 2026 10:29:28 -0800

I think this matches a lot of the original non-spark reading and planning
code in the Iceberg project, the only big difference
I can see is the handling of manifests themselves which were streamed in
the Iceberg version originally (see ManifestGroup for context and
nightmares).
I can see that we are worried about leaving open file handles here so that
is an ok tradeoff if we want to make it.


On Thu, Jan 8, 2026 at 4:04 AM Pierre Laporte <[email protected]> wrote:

> As far as I can tell, here is the space complexity for each method.  The
> names used correspond to:
>
> * PM = number of previous metadata files
> * S = number of snapshots
> * ST = number of statistics files
> * PST = number of partition statistics files
> * UM = number of unique manifest files across all snapshots
> * T = total number of created TaskEntities
>
> The getMetadataFileBatches method has a space complexity of `O(PM + S + ST
> + PST)`.
> Same thing for the getMetadataTaskStream method.
> The getManifestTaskStream method has a space complexity of `O(UM)`.
> The handleTask method has a space complexity of `O(UM + PM + S + ST + PST +
> T)`
>
> Based on those elements, it is clear that the current implementation will
> run into heap pressure for tables with many snapshots and frequent updates,
> or tables with long metadata history.
>
> On Wed, Jan 7, 2026 at 9:55 PM Yufei Gu <[email protected]> wrote:
>
> > Hi all,
> >
> > After taking a closer look, I am not sure the issue as currently
> described
> > is actually valid.
> >
> > The base64 encoded manifest objects[1] being discussed are not the
> manifest
> > files themselves. They are objects representing manifest files, which can
> > be reconstructed from the manifest entries stored in the ManifestList
> > files. As a result, the in memory footprint should be roughly equivalent
> to
> > the size of a single manifest list file per snapshot, plus some
> additional
> > base64 encoding overhead. That overhead does not seem significant enough
> on
> > its own to explain large heap pressure.
> >
> > This pattern is also handled in practice today. For example, multiple
> Spark
> > procedures/actions and Spark planning process these manifest
> > representations within a single node, typically the driver, without
> > materializing full manifest files in memory. One concrete example is
> here:
> >
> >
> https://github.com/apache/iceberg/blob/bf1074ff373c929614a3f92dd4e46780028ac1ca/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java#L290
> >
> > Given this, I am not convinced that embedding these manifest
> > representations is inherently problematic from a memory perspective. If
> > there are concrete scenarios where this still leads to excessive memory
> > usage, it would be helpful to clarify where the amplification happens
> > beyond what is already expected from manifest list processing.
> >
> > Happy to be corrected if I am missing something, but wanted to share this
> > observation before we anchor further design decisions on this assumption.
> >
> > 1.
> >
> >
> https://github.com/apache/polaris/blob/c9efc6c1af202686945efe2e19125e8f116a0206/runtime/service/src/main/java/org/apache/polaris/service/task/TableCleanupTaskHandler.java#L194
> >
> > Yufei
> >
> >
> > On Tue, Jan 6, 2026 at 8:46 PM Yufei Gu <[email protected]> wrote:
> >
> > > Thanks Adam and Robert for the replies.
> > >
> > > Just to make sure I am understanding this correctly.
> > >
> > > The core issue the proposal is addressing is described in
> > > https://github.com/apache/polaris/issues/2365 that, today, the full
> > > binary Iceberg manifest files, base64 encoded, are embedded in the task
> > > entities. As a result, when a purge runs, all manifests for a table end
> > up
> > > being materialized in memory at once. This behavior creates significant
> > > heap pressure and may lead to out of memory failures during purge
> > > operations.
> > >
> > > Please let me know if this matches your intent, or if I am missing
> > > anything.
> > >
> > > Yufei
> > >
> > >
> > > On Sat, Dec 20, 2025 at 4:53 AM Robert Stupp <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Thanks Adam! You're spot on!
> > >>
> > >> I wanted to separate the discussions about this functionality and the
> > >> related async & reliable tasks proposal.
> > >>
> > >> The functionality of this one is generally intended for long(er)
> > >> running operations against object stores, and already provides the
> > >> necessary functionality to fix the existing OOM issue.
> > >>
> > >> Robert
> > >>
> > >> [1] https://lists.apache.org/thread/kqm0w38p7bnojq455yz7d2vdfp6ky1h7
> > >>
> > >> On Fri, Dec 19, 2025 at 3:43 PM Adam Christian
> > >> <[email protected]> wrote:
> > >> >
> > >> > Howdy Robert,
> > >> >
> > >> > I reviewed the PR and it is very clean. I really enjoy the
> simplicity
> > of
> > >> > the FileOperations interface. I think that's going to be a great
> > >> extension
> > >> > point.
> > >> >
> > >> > One of the things I was wondering about was what to do with the
> > current
> > >> > purge implementation. I understand that it has a high likelihood of
> > >> having
> > >> > an Out of Memory exception [1]. I'm wondering if your idea was to
> > build
> > >> > this, then replace the current implementation. I'd love to ensure
> that
> > >> we
> > >> > have a plan for one clean, stable implementation.
> > >> >
> > >> > [1] - https://github.com/apache/polaris/issues/2365
> > >> >
> > >> > Go community,
> > >> >
> > >> > Adam
> > >> >
> > >> > On Tue, Dec 16, 2025 at 10:25 AM Adam Christian <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Yufei,
> > >> > >
> > >> > > Great questions!
> > >> > >
> > >> > > From what I can see in the PR, here are the answers to your
> > questions:
> > >> > > 1. The first major scenario is improving the memory concerns with
> > >> purge.
> > >> > > That's important to stabilize a core use case in the service.
> > >> > > 2. These are related specifically to file operations. I cannot
> see a
> > >> way
> > >> > > that it would be broader than that.
> > >> > >
> > >> > > Go community,
> > >> > >
> > >> > > Adam
> > >> > >
> > >> > > On Mon, Dec 15, 2025, 3:20 PM Yufei Gu <[email protected]>
> > wrote:
> > >> > >
> > >> > >> Hi Robert,
> > >> > >>
> > >> > >> Thanks for sharing the proposal and the PR. Before diving deeper
> > >> into the
> > >> > >> API shape, I was hoping to better understand the intended use
> cases
> > >> you
> > >> > >> have in mind:
> > >> > >>
> > >> > >> 1. What concrete scenarios are you primarily targeting with these
> > >> > >> long-running object store operations?
> > >> > >> 2. Are these mostly expected to be file/object-level maintenance
> > >> tasks
> > >> > >> (e.g. purge, cleanup), or do you envision broader categories of
> > >> operations
> > >> > >> leveraging the same abstraction?
> > >> > >>
> > >> > >> Having a clearer picture of the motivating use cases would help
> > >> evaluate
> > >> > >> the right level of abstraction and where this should live
> > >> architecturally.
> > >> > >>
> > >> > >> Looking forward to the discussion.
> > >> > >>
> > >> > >> Yufei
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Dec 12, 2025 at 3:48 AM Robert Stupp <[email protected]>
> > wrote:
> > >> > >>
> > >> > >> > Hi all,
> > >> > >> >
> > >> > >> > I'd like to propose an API and corresponding implementation for
> > >> (long
> > >> > >> > running) object store operations.
> > >> > >> >
> > >> > >> > It provides a CPU and heap-friendly API and implementation to
> > work
> > >> > >> > against object stores. It is built in a way to provide
> > "pluggable"
> > >> > >> > functionality. What I mean is this (Java pseudo code):
> > >> > >> > ---
> > >> > >> > FileOperations fileOps =
> > >> > >> > fileOperationsFactory.createFileOperations(fileIoInstance);
> > >> > >> > Stream<FileSpec> allIcebergTableFiles = fileOps.
> > >> > >> >     identifyIcebergTableFiles(metadataLocation);
> > >> > >> > PurgeStats purged = fileOps.purge(allIcebergTableFiles);
> > >> > >> > // or simpler:
> > >> > >> > PurgeStats purged =
> fileOps.purgeIcebergTable(metadataLocation);
> > >> > >> > // or similarly for Iceberg views
> > >> > >> > PurgeStats purged = fileOps.purgeIcebergView(metadataLocation);
> > >> > >> > // or to purge all files underneath a prefix
> > >> > >> > PurgeStats purged = fileOps.purge(fileOps.findFiles(prefix));
> > >> > >> > ---
> > >> > >> >
> > >> > >> > Not mentioned in the pseudo code is the ability to rate-limit
> the
> > >> > >> > number of purged files or batch-deletions and configure the
> > >> deletion
> > >> > >> > batch-size.
> > >> > >> >
> > >> > >> > The PR already contains tests against an on-heap object store
> > mock
> > >> and
> > >> > >> > integration tests against S3/GCS/Azure emulators.
> > >> > >> >
> > >> > >> > More details can be found in the README [2] included in the PR
> > and
> > >> of
> > >> > >> > course in the code in the PR.
> > >> > >> >
> > >> > >> > Robert
> > >> > >> >
> > >> > >> > [1] https://github.com/apache/polaris/pull/3256
> > >> > >> > [2]
> > >> > >> >
> > >> > >>
> > >>
> >
> https://github.com/snazy/polaris/blob/obj-store-ops/storage/files/README.md
> > >> > >> >
> > >> > >>
> > >> > >
> > >>
> > >
> >
>

Re: [PROPOSAL] Object store functionality

Reply via email to