+1 Looking forward to this feature John Zhuge
On Thu, Jun 5, 2025 at 2:22 PM Ryan Blue <rdb...@gmail.com> wrote: > > I think it does not make sense to stick manifest files to Avro if we > break column stats into sub fields. > > This isn't necessarily true. Avro can benefit from better pushdown with > Eduard's approach as well by being able to skip more efficiently. With the > current layout, Avro stores a list of key/value pairs that are all > projected and put into a map. We avoid decoding the values, but each field > ID is decoded, then the length of the value is decoded, and finally there > is a put operation with an ID and value ByteBuffer pair. With the new > approach, we will be able to know which fields are relevant and skip > unprojected fields based on the file schema, which we couldn't do before. > > To skip stats for an unused field (not part of the filter), there are two > cases. Lower/upper bounds for types that are fixed width are skipped by > updating the read position. And bounds for types that are variable length > (strings and binary) are skipped by reading the length and skipping that > number of bytes. > > It turns out that actually producing the metric maps is a fairly expensive > operation, so being able to skip metrics more quickly even if the bytes > still have to be read is going to save time. That said, using a columnar > format is still going to be a good idea! > > On Wed, Jun 4, 2025 at 11:22 PM Gang Wu <ust...@gmail.com> wrote: > >> > Together with the change which allows storing metadata in columnar >> formats >> >> +1 on this. I think it does not make sense to stick manifest files to >> Avro if we break column stats into sub fields. >> >> On Tue, Jun 3, 2025 at 7:19 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> I would love to see more flexibility in file stats. Together with the >>> change which allows storing metadata in columnar formats will open up many >>> new possibilities. Bloom filters in metadata which could be used for >>> filtering out files, HLL scratches etc.... >>> >>> +1 for the change >>> >>> On Tue, Jun 3, 2025, 08:12 Szehon Ho <szehon.apa...@gmail.com> wrote: >>> >>>> +1 , excited for this one too, we've seen the current metrics maps blow >>>> up the memory and hope can improve that. >>>> >>>> On the Geo front, this could allow us to add supplementary metrics that >>>> don't conform to the geo type, like S2 Cell Ids. >>>> >>>> Thanks >>>> Szehon >>>> >>>> On Mon, Jun 2, 2025 at 6:14 AM Eduard Tudenhöfner < >>>> etudenhoef...@apache.org> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> I'm starting a thread to connect folks interested in improving the >>>>> existing way of collecting column-level statistics (often referred to as >>>>> *metrics* in the code). I've already started a proposal, which can be >>>>> found at https://s.apache.org/iceberg-column-stats. >>>>> >>>>> *Motivation* >>>>> >>>>> Column statistics are currently stored as a mapping of field id to >>>>> values across multiple columns (lower/upper bounds, value/nan/null >>>>> counts, sizes). This storage model has critical limitations as the >>>>> number of columns increases and as new types are being added to Iceberg: >>>>> >>>>> - >>>>> >>>>> Inefficient Storage due to map-based structure: >>>>> - >>>>> >>>>> Large memory overhead during planning/processing >>>>> - >>>>> >>>>> Inability to project specific stats (e.g., only >>>>> null_value_counts for column X) >>>>> - >>>>> >>>>> Type Erasure: Original logical/physical types are lost when stored >>>>> as binary blobs, causing: >>>>> - >>>>> >>>>> Lossy type inference during reads >>>>> - Schema evolution challenges (e.g., widening types) >>>>> - Rigid Schema: Stats are tied to the data_fil entry record, >>>>> limiting extensibility for new stats. >>>>> >>>>> >>>>> *Goals* >>>>> >>>>> Improve the column stats representation to allow for the following: >>>>> >>>>> - >>>>> >>>>> Projectability: Enable independent access to specific stats (e.g., >>>>> lower_bounds without loading upper_bounds). >>>>> - >>>>> >>>>> Type Preservation: Store original data types to support accurate >>>>> reads and schema evolution. >>>>> - >>>>> >>>>> Flexible/Extensible Representation: Allow per-field stats >>>>> structures (e.g., complex types like Geo/Variant). >>>>> >>>>> >>>>> >>>>> Thanks >>>>> Eduard >>>>> >>>>