Re: [DISCUSS] Proposal: Returning Commit Results from commit()

Jason Fine Sun, 28 Sep 2025 01:17:40 -0700

Thanks for the great discussion so far. I’d like to suggest moving forward with 
a proposal that we introduces a new commitWithResult() operation, which would 
return the TableMetadata produced by the commit. This avoids breaking the 
existing API while allowing users who need the result to get it. It also allows 
certain implementations to keep the normal commit operation more efficient if 
collecting the result would require more work.

To support this, we would need to move TableMetadata into the api module. There 
are two viable options here:

1. Move the full TableMetadata as an interface class to the API module.

2. Introduce a minimal ReadOnlyTableMetadata interface in the API module. I 
think this is sufficient since users should really need to mutate the 
TableMetadata manually.

I suggest option 2.

Given that the REST API already returns the full metadata JSON, and other 
implementation build it locally, standardizing this behavior across all commit 
paths seems reasonable and consistent.

From: Russell Spitzer <[email protected]>
Date: Tuesday, 23 September 2025 at 20:57
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] Proposal: Returning Commit Results from commit()
CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

I think we are still waiting here on the full proposal, it sounds like having 
TableOperations return a metadata object on commit probably is the best idea 
here, although it would be up to the caller to understand what potentially was 
changed by the operations itself and not by other operations. The REST API for 
posting a commit does return the full metadata json payload so piping that 
through probably isn't an issue. It may make sense to just standardize 
everything to do this in the future since the REST API already does it ...

Anyway in this case we would be changing all "commit" apis as well as 
TableOperations itself so two pretty major changes that I think we would have 
to target for a 2.0 release of Iceberg. That said we can keep the discussion 
going now and see whether or not there is community consensus around this 
change.

On Mon, Sep 15, 2025 at 12:51 PM Yufei Gu 
<[email protected]<mailto:[email protected]>> wrote:
Hi Endi, Could you elaborate on your use case? Once a commit succeeds, the 
client already holds the latest snapshot as it's a part of the request, so 
what’s the need for an additional call? For any subsequent commits, the client 
would have to reload the table regardless.

Yufei

On Mon, Sep 15, 2025 at 9:34 AM Endi Caushi 
<[email protected]<mailto:[email protected]>> wrote:
Hi

>  it's a rather heavy change and should probably be backed by some concrete 
> use cases where the client needs the exact metadata object produced by the 
> operation.

Apologies for chiming in late, but I wanted to share an example from our side.
We ingest our data pipelines incrementally using PySpark, leveraging the 
snapshotID as a watermark.
After each run, we store the new snapshotID in the snapshot summary as the 
updated watermark. It would be very convenient if the commit() operation 
returned the snapshotID directly, as it would save us from making an additional 
round-trip just to retrieve the latest snapshot.

Best regards,
Endi

On Sun, 14 Sept 2025 at 21:44, yuxia 
<[email protected]<mailto:[email protected]>> wrote:
Hi, Peter,
Yes, you're right. I meant Apache Fluss, sorry for the mistake.

Thanks for your suggestion, the workaround you proposed can also solve our 
problem.

Best regards,
Yuxia

________________________________
发件人: "Jason Fine" <[email protected]>
收件人: "dev" <[email protected]<mailto:[email protected]>>
发送时间: 星期日, 2025年 9 月 14日 下午 7:43:05
主题: Re: [DISCUSS] Proposal: Returning Commit Results from commit()

I was about to say that we should ask the Flink/Fluss team about this since 
they also do streaming stateful transforms so I expected you would need it as 
well!

That’s a neat trick with the listener, but I agree it’s a little hacky and a 
cleaner interface would be nice. In our case it’s ok if we get a newer commit 
so we can rely on the refreshed data, I is also another good point that if you 
rely on calling currentSnapshot() currently you may get newer data than desired 
since the implementation may call refresh() after a commit.
Russel, regarding some of the points you brought up. I think that if we add a 
new method to the interface for this will help since future versions of the 
Rest Catalog that may only send updates the method that doesn’t return a result 
can just send the update and not load the response data while the other method 
will.
Regarding implementing all the XXXOperations, I found it to be not much work 
since most of them inherit from the same base class and the info is available 
to the operation itself. In the future if the catalogs get more complex with 
the REST partial update request they may require some more work to get the 
required info back from the TableOperations class. For now though it seems like 
it’s not necessary.

Regarding the return type I think there are two decent options:
1. Return TableMetadata (or a minimized ReadOnly interface version of it since 
it’s currently not in the API project)
   Pros – Contains all data that the user may want
   Cons – Might be slower and heavier for future implementations of things like 
the rest catalog
2. Return just locally created info particular to the current operation
   Pros – Can always return locally without additional network calls
   Cons
·          Might not always contain all the info the user wants
·         Implementation requires more work as each Operation is different
·         Might require an additional interface or expanding SnapshotUpdate 
with an additional generic argument

From: Péter Váry 
<[email protected]<mailto:[email protected]>>
Date: Friday, 12 September 2025 at 15:53
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: [DISCUSS] Proposal: Returning Commit Results from commit()
CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

Hi Yuxia,

You meant Apache Fluss instead of Apache Flink, right? :)

As a workaround in the meantime, you could add an UUID changeset identifier to 
the commit summary. After refresh, you can find the corresponding snapshot in 
by searching the commits for this UUID.

yuxia <[email protected]<mailto:[email protected]>> ezt 
írta (időpont: 2025. szept. 12., P, 14:30):
Hi, Jason.

Thanks for bringing this up. When integrating Apache Iceberg with Apache Flink 
(incubating)[1], we also needed to capture commit results and store snapshot 
IDs in our internal state to track tiering progress. However, we can’t simply 
refresh the table to get the latest snapshot, since other writers may be 
concurrently committing to the same table — and we only want the snapshot 
generated by our own commit. To work around this, we used the Iceberg listener 
mechanism[2], but this still feels a bit like a hack. It would be much cleaner 
if Iceberg provided a standard interface to return commit results.

[1] https://github.com/apache/fluss
[2] 
https://github.com/apache/fluss/blob/03313a9b02dca57c87c406f0ecf396b08fa8726a/fluss-lake/fluss-lake-iceberg/src/main/java/org/apache/fluss/lake/iceberg/tiering/IcebergLakeCommitter.java#L328

Best regards,
Yuxia

________________________________
发件人: "Russell Spitzer" 
<[email protected]<mailto:[email protected]>>
收件人: "dev" <[email protected]<mailto:[email protected]>>
发送时间: 星期五, 2025年 9 月 12日 上午 2:21:56
主题: Re: [DISCUSS] Proposal: Returning Commit Results from commit()

I don't think I'm opposed to this idea in general but I think we probably need
to get some concrete examples of how this is going to be used by a consumer. 
Since this
would require modifying every implementation of XXXOperation that we currently 
have; it's
a rather heavy change and should probably be backed by some concrete use cases 
where
the client needs the exact metadata object produced by the operation.

We also need to actually nail down in the proposal the return type as you 
mentioned. I don't think
there is a problem returning a table metadata object but this would be rather 
complicated for any
REST catalog interface. A Rest Catalog would still require a round trip to the 
Catalog to get the new
state since there is no other way to know what was actually committed as the 
metadata.json is
written remotely so we would still be leaning on TableOperations to actually 
figure out what that is.

For the future, we probably will also have issues as we move towards a "send 
changes" to the catalog
model instead of a "send new state" model. In those cases we will also have the 
issue of not actually
knowing what was committed without contacting the catalog after the commit 
succeeds. So we also
need to consider how the REST Spec would need to change to support this.

On Thu, Sep 11, 2025 at 5:08 AM Jason Fine <[email protected]> wrote:
Hi all,
I’d like to start a discussion about PR 
#13987<https://github.com/apache/iceberg/pull/13987>  which adds support for 
returning results from the commit() operation.
________________________________
What this PR is about

The core idea is: when a client calls commit(), they should be able to 
immediately obtain the updated information produced by the commit (whether 
that’s a new snapshot or updated table metadata), instead of performing a 
redundant refresh()afterwards. This is useful for distributed system that want 
to track their progress and save progress state. But I’m sure it will have many 
other uses as well.
Calling refresh unnecessarily is a slowdown, but also it also counts against 
your quota for rate limits in certain services.

I know some implementations currently don’t call refresh() but others do, and 
the interface doesn’t enforce this, and the wanted information is already 
available in the client after the commit as it produced it.

________________________________
Key points/concerns raised

  *   API compatibility breakage: Several folks pointed out that returning 
updated snapshot or metadata from commit() changes the existing API contract. 
We can resolve this by adding a new method instead.
  *   What counts as a snapshot: Some committed operations don’t produce 
snapshots — e.g. metadata operations (schema changes, property updates). The 
distinction between operations that produce snapshots vs ones that just update 
metadata matters. Perhaps returning the TableMetadata like mentioned below 
always is a good a solution.
  *   Behavior varies by catalog implementation: Some implementations already 
refresh automatically (shouldRefresh etc.), others don’t. RestCatalog vs 
Metastore vs others behave differently.

________________________________
Proposal / Possible compromises
From the discussion, here are options that seem promising to me, or ways to 
mitigate the drawbacks:

  1.  Add a new method, e.g. commitWithResult(...)
This method would commit and return the updated snapshot / metadata, but leave 
the existing commit() method with its current behavior. That way we retain 
backward compatibility.
  2.  Return a read-only metadata snapshot
If returning the full metadata object is too heavy or too risky, return a 
minimal “read‐only” summary containing just what is needed (snapshotId, maybe 
timestamp). This reduces implementation risk. 
GitHub<https://github.com/apache/iceberg/pull/14023/files#diff-c941602822c0e1c24d7de4ef5db76105d414d7b3f0b26df7ca75e76ba79e9663>.
 This is also helpful if we want to avoid adding a new Generic argument to the 
SnpashotUpdate interface.

Please let me know what you think about this suggestion and how we can move it 
forwards.

Thanks,
Jason

The information transmitted by Qlik is intended only for the person or entity 
to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this in error, 
please contact the sender and delete the material from any computer. Qlik's 
Privacy & Cookie 
Notice<https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how 
we handle personal information

The information transmitted by Qlik is intended only for the person or entity 
to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this in error, 
please contact the sender and delete the material from any computer. Qlik's 
Privacy & Cookie 
Notice<https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how 
we handle personal information

The information transmitted by Qlik is intended only for the person or entity 
to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this in error, 
please contact the sender and delete the material from any computer. Qlik's 
Privacy & Cookie 
Notice<https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how 
we handle personal information

Re: [DISCUSS] Proposal: Returning Commit Results from commit()

Reply via email to