> Together with the change which allows storing metadata in columnar formats
+1 on this. I think it does not make sense to stick manifest files to Avro if we break column stats into sub fields. On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com> wrote: > I would love to see more flexibility in file stats. Together with the > change which allows storing metadata in columnar formats will open up many > new possibilities. Bloom filters in metadata which could be used for > filtering out files, HLL scratches etc.... > > +1 for the change > > On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote: > >> +1 , excited for this one too, we've seen the current metrics maps blow >> up the memory and hope can improve that. >> >> On the Geo front, this could allow us to add supplementary metrics that >> don't conform to the geo type, like S2 Cell Ids. >> >> Thanks >> Szehon >> >> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >> etudenhoef...@apache.org> wrote: >> >>> Hey everyone, >>> >>> I'm starting a thread to connect folks interested in improving the >>> existing way of collecting column-level statistics (often referred to as >>> *metrics* in the code). I've already started a proposal, which can be >>> found at https://s.apache.org/iceberg-column-stats. >>> >>> *Motivation* >>> >>> Column statistics are currently stored as a mapping of field id to >>> values across multiple columns (lower/upper bounds, value/nan/null >>> counts, sizes). This storage model has critical limitations as the >>> number of columns increases and as new types are being added to Iceberg: >>> >>> - >>> >>> Inefficient Storage due to map-based structure: >>> - >>> >>> Large memory overhead during planning/processing >>> - >>> >>> Inability to project specific stats (e.g., only null_value_counts >>> for column X) >>> - >>> >>> Type Erasure: Original logical/physical types are lost when stored >>> as binary blobs, causing: >>> - >>> >>> Lossy type inference during reads >>> - Schema evolution challenges (e.g., widening types) >>> - Rigid Schema: Stats are tied to the data_fil entry record, >>> limiting extensibility for new stats. >>> >>> >>> *Goals* >>> >>> Improve the column stats representation to allow for the following: >>> >>> - >>> >>> Projectability: Enable independent access to specific stats (e.g., >>> lower_bounds without loading upper_bounds). >>> - >>> >>> Type Preservation: Store original data types to support accurate >>> reads and schema evolution. >>> - >>> >>> Flexible/Extensible Representation: Allow per-field stats structures >>> (e.g., complex types like Geo/Variant). >>> >>> >>> >>> Thanks >>> Eduard >>> >>