Re: [DISCUSS] v4 - Improved column statistics

John Zhuge Thu, 05 Jun 2025 14:41:50 -0700

+1 Looking forward to this feature

John Zhuge



On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote:

> > I think it does not make sense to stick manifest files to Avro if we
> break column stats into sub fields.
>
> This isn't necessarily true. Avro can benefit from better pushdown with
> Eduard's approach as well by being able to skip more efficiently. With the
> current layout, Avro stores a list of key/value pairs that are all
> projected and put into a map. We avoid decoding the values, but each field
> ID is decoded, then the length of the value is decoded, and finally there
> is a put operation with an ID and value ByteBuffer pair. With the new
> approach, we will be able to know which fields are relevant and skip
> unprojected fields based on the file schema, which we couldn't do before.
>
> To skip stats for an unused field (not part of the filter), there are two
> cases. Lower/upper bounds for types that are fixed width are skipped by
> updating the read position. And bounds for types that are variable length
> (strings and binary) are skipped by reading the length and skipping that
> number of bytes.
>
> It turns out that actually producing the metric maps is a fairly expensive
> operation, so being able to skip metrics more quickly even if the bytes
> still have to be read is going to save time. That said, using a columnar
> format is still going to be a good idea!
>
> On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote:
>
>> > Together with the change which allows storing metadata in columnar
>> formats
>>
>> +1 on this. I think it does not make sense to stick manifest files to
>> Avro if we break column stats into sub fields.
>>
>> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com>
>> wrote:
>>
>>> I would love to see more flexibility in file stats. Together with the
>>> change which allows storing metadata in columnar formats will open up many
>>> new possibilities. Bloom filters in metadata which could be used for
>>> filtering out files, HLL scratches etc....
>>>
>>> +1 for the change
>>>
>>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote:
>>>
>>>> +1 , excited for this one too, we've seen the current metrics maps blow
>>>> up the memory and hope can improve that.
>>>>
>>>> On the Geo front, this could allow us to add supplementary metrics that
>>>> don't conform to the geo type, like S2 Cell Ids.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner <
>>>> etudenhoef...@apache.org> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I'm starting a thread to connect folks interested in improving the
>>>>> existing way of collecting column-level statistics (often referred to as
>>>>> *metrics* in the code). I've already started a proposal, which can be
>>>>> found at https://s.apache.org/iceberg-column-stats.
>>>>>
>>>>> *Motivation*
>>>>>
>>>>> Column statistics are currently stored as a mapping of field id to
>>>>> values across multiple columns (lower/upper bounds, value/nan/null
>>>>> counts, sizes). This storage model has critical limitations as the
>>>>> number of columns increases and as new types are being added to Iceberg:
>>>>>
>>>>>    -
>>>>>
>>>>>    Inefficient Storage due to map-based structure:
>>>>>    -
>>>>>
>>>>>       Large memory overhead during planning/processing
>>>>>       -
>>>>>
>>>>>       Inability to project specific stats (e.g., only
>>>>>       null_value_counts for column X)
>>>>>       -
>>>>>
>>>>>    Type Erasure: Original logical/physical types are lost when stored
>>>>>    as binary blobs, causing:
>>>>>    -
>>>>>
>>>>>       Lossy type inference during reads
>>>>>       - Schema evolution challenges (e.g., widening types)
>>>>>    - Rigid Schema: Stats are tied to the data_fil entry record,
>>>>>    limiting extensibility for new stats.
>>>>>
>>>>>
>>>>> *Goals*
>>>>>
>>>>> Improve the column stats representation to allow for the following:
>>>>>
>>>>>    -
>>>>>
>>>>>    Projectability: Enable independent access to specific stats (e.g.,
>>>>>    lower_bounds without loading upper_bounds).
>>>>>    -
>>>>>
>>>>>    Type Preservation: Store original data types to support accurate
>>>>>    reads and schema evolution.
>>>>>    -
>>>>>
>>>>>    Flexible/Extensible Representation: Allow per-field stats
>>>>>    structures (e.g., complex types like Geo/Variant).
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>> Eduard
>>>>>
>>>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to