+1 , excited for this one too, we've seen the current metrics maps blow up the memory and hope can improve that.
On the Geo front, this could allow us to add supplementary metrics that don't conform to the geo type, like S2 Cell Ids. Thanks Szehon On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <etudenhoef...@apache.org> wrote: > Hey everyone, > > I'm starting a thread to connect folks interested in improving the > existing way of collecting column-level statistics (often referred to as > *metrics* in the code). I've already started a proposal, which can be > found at https://s.apache.org/iceberg-column-stats. > > *Motivation* > > Column statistics are currently stored as a mapping of field id to values > across multiple columns (lower/upper bounds, value/nan/null counts, sizes). > This storage model has critical limitations as the number of columns > increases and as new types are being added to Iceberg: > > - > > Inefficient Storage due to map-based structure: > - > > Large memory overhead during planning/processing > - > > Inability to project specific stats (e.g., only null_value_counts > for column X) > - > > Type Erasure: Original logical/physical types are lost when stored as > binary blobs, causing: > - > > Lossy type inference during reads > - Schema evolution challenges (e.g., widening types) > - Rigid Schema: Stats are tied to the data_fil entry record, limiting > extensibility for new stats. > > > *Goals* > > Improve the column stats representation to allow for the following: > > - > > Projectability: Enable independent access to specific stats (e.g., > lower_bounds without loading upper_bounds). > - > > Type Preservation: Store original data types to support accurate reads > and schema evolution. > - > > Flexible/Extensible Representation: Allow per-field stats structures > (e.g., complex types like Geo/Variant). > > > > Thanks > Eduard >