Hi all,
I’d like to start a discussion about PR 
#13987<https://github.com/apache/iceberg/pull/13987>  which adds support for 
returning results from the commit() operation.
________________________________
What this PR is about

The core idea is: when a client calls commit(), they should be able to 
immediately obtain the updated information produced by the commit (whether 
that’s a new snapshot or updated table metadata), instead of performing a 
redundant refresh()afterwards. This is useful for distributed system that want 
to track their progress and save progress state. But I’m sure it will have many 
other uses as well.
Calling refresh unnecessarily is a slowdown, but also it also counts against 
your quota for rate limits in certain services.

I know some implementations currently don’t call refresh() but others do, and 
the interface doesn’t enforce this, and the wanted information is already 
available in the client after the commit as it produced it.

________________________________
Key points/concerns raised

  *   API compatibility breakage: Several folks pointed out that returning 
updated snapshot or metadata from commit() changes the existing API contract. 
We can resolve this by adding a new method instead.
  *   What counts as a snapshot: Some committed operations don’t produce 
snapshots ― e.g. metadata operations (schema changes, property updates). The 
distinction between operations that produce snapshots vs ones that just update 
metadata matters. Perhaps returning the TableMetadata like mentioned below 
always is a good a solution.
  *   Behavior varies by catalog implementation: Some implementations already 
refresh automatically (shouldRefresh etc.), others don’t. RestCatalog vs 
Metastore vs others behave differently.

________________________________
Proposal / Possible compromises
From the discussion, here are options that seem promising to me, or ways to 
mitigate the drawbacks:

  1.  Add a new method, e.g. commitWithResult(...)
This method would commit and return the updated snapshot / metadata, but leave 
the existing commit() method with its current behavior. That way we retain 
backward compatibility.
  2.  Return a read-only metadata snapshot
If returning the full metadata object is too heavy or too risky, return a 
minimal “read�\only” summary containing just what is needed (snapshotId, maybe 
timestamp). This reduces implementation risk. 
GitHub<https://github.com/apache/iceberg/pull/14023/files#diff-c941602822c0e1c24d7de4ef5db76105d414d7b3f0b26df7ca75e76ba79e9663>.
 This is also helpful if we want to avoid adding a new Generic argument to the 
SnpashotUpdate interface.

Please let me know what you think about this suggestion and how we can move it 
forwards.

Thanks,
Jason


The information transmitted by Qlik is intended only for the person or entity 
to which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon, this information by persons or entities other 
than the intended recipient is prohibited. If you received this in error, 
please contact the sender and delete the material from any computer. Qlik's 
Privacy & Cookie 
Notice<https://www.qlik.com/us/legal/privacy-and-cookie-notice> describes how 
we handle personal information

Reply via email to