Hi Team,

I'd like to initiate a discussion on a feature that appears to be valuable
for Flink + Iceberg users. (Related issue: #14662
<https://github.com/apache/iceberg/issues/14662>)

Currently, Iceberg's FlinkSink offers a set of data statistics
<https://github.com/apache/iceberg/blob/main/flink/v2.1/flink/src/main/java/org/apache/iceberg/flink/sink/CommitSummary.java#L33-L38>
in snapshot summary. However, there is no mechanism for application-level
code to populate *custom/application-defined* statistics into Iceberg
snapshot properties at commit time. An example use case:

A Flink job computes the event-time boundaries for data ingested in each
checkpoint (min/max event time in that batch) and aims to include this
information in the snapshot summary, alongside the built-in statistics. The
snapshot summary is a natural place for such metadata, since the statistics
directly describe the data in that specific snapshot and belong with the
snapshot itself. At the same time, the logic for computing these statistics
is application-specific, making it difficult to handle entirely within the
Iceberg framework.

We've explored workarounds (such as static variables and external store,
see this PR <https://github.com/apache/iceberg/pull/14594>) to pass these
values to the committer, but these approaches are either not robust or add
unnecessary complexity.

Is there a recommended Flink-native approach for allowing applications to
propagate custom, per-checkpoint metadata from Flink operators to the
Iceberg committer to write to snapshot summary? If not, would the community
be interested in supporting such a feature?

Any guidance or pointers to related work would be appreciated. We’re also
happy to contribute if this aligns with project goals.

Thanks,
Owen

Reply via email to