Re: [DISCUSS] v4 - Improved column statistics

Gang Wu Wed, 04 Jun 2025 23:22:41 -0700

> Together with the change which allows storing metadata in columnar formats


+1 on this. I think it does not make sense to stick manifest files to Avro
if we break column stats into sub fields.

On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> I would love to see more flexibility in file stats. Together with the
> change which allows storing metadata in columnar formats will open up many
> new possibilities. Bloom filters in metadata which could be used for
> filtering out files, HLL scratches etc....
>
> +1 for the change
>
> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote:
>
>> +1 , excited for this one too, we've seen the current metrics maps blow
>> up the memory and hope can improve that.
>>
>> On the Geo front, this could allow us to add supplementary metrics that
>> don't conform to the geo type, like S2 Cell Ids.
>>
>> Thanks
>> Szehon
>>
>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I'm starting a thread to connect folks interested in improving the
>>> existing way of collecting column-level statistics (often referred to as
>>> *metrics* in the code). I've already started a proposal, which can be
>>> found at https://s.apache.org/iceberg-column-stats.
>>>
>>> *Motivation*
>>>
>>> Column statistics are currently stored as a mapping of field id to
>>> values across multiple columns (lower/upper bounds, value/nan/null
>>> counts, sizes). This storage model has critical limitations as the
>>> number of columns increases and as new types are being added to Iceberg:
>>>
>>>    -
>>>
>>>    Inefficient Storage due to map-based structure:
>>>    -
>>>
>>>       Large memory overhead during planning/processing
>>>       -
>>>
>>>       Inability to project specific stats (e.g., only null_value_counts
>>>       for column X)
>>>       -
>>>
>>>    Type Erasure: Original logical/physical types are lost when stored
>>>    as binary blobs, causing:
>>>    -
>>>
>>>       Lossy type inference during reads
>>>       - Schema evolution challenges (e.g., widening types)
>>>    - Rigid Schema: Stats are tied to the data_fil entry record,
>>>    limiting extensibility for new stats.
>>>
>>>
>>> *Goals*
>>>
>>> Improve the column stats representation to allow for the following:
>>>
>>>    -
>>>
>>>    Projectability: Enable independent access to specific stats (e.g.,
>>>    lower_bounds without loading upper_bounds).
>>>    -
>>>
>>>    Type Preservation: Store original data types to support accurate
>>>    reads and schema evolution.
>>>    -
>>>
>>>    Flexible/Extensible Representation: Allow per-field stats structures
>>>    (e.g., complex types like Geo/Variant).
>>>
>>>
>>>
>>> Thanks
>>> Eduard
>>>
>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to