Re: [DISCUSS] v4 - Improved column statistics

Eduard Tudenhöfner Fri, 19 Sep 2025 01:53:34 -0700

Hey everyone,

I have updated the proposal
<https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0#heading=h.hs6r9d26w1y2>
with the following things:


   - removed *column_size*, since this hasn't been used anywhere in earlier
   versions. Please shout if you think we should keep this going forward.
   - added *avg_value_size* and *max_value_size* for avg/max value sizes of
   variable-length types (string/binary)
   - the examples in the proposal were using *1_417_000_000* as the
   starting stats ID for the reserved field ID space, but that should have
   been *2_147_000_000* because we have 200 reserved IDs * 200 stats types
   = 40k and using *2_147_000_000* leaves enough room in case we decide to
   add other ID spaces

If people are ok then I think we should be able to vote on the design
proposal so that we could get the first portions of the code
<https://github.com/apache/iceberg/pull/13933> in, which would allow
parallelizing downstream work on this


Thanks
Eduard

On Wed, Aug 20, 2025 at 3:05 PM Eduard Tudenhöfner <[email protected]>
wrote:

> Hey everyone,
>
> We met yesterday and talked about some details around the stats proposal.
>
> Please find the notes here
> <https://docs.google.com/document/d/1ZK5g8_bA1Y9SQ4UA5jAREX9iNX56xLWA5vAuKpQC4L8/edit?usp=sharing>
> and the recording here
> <https://drive.google.com/file/d/1YIILCIhDbgu3OYlMn5KNChsYFP8rGPPX/view?usp=sharing>
> .
>
> I have updated the proposal <https://s.apache.org/iceberg-column-stats>
> with the following points:
>
>    - added a table schema example with a detailed stats schema
>    - updated wording to make it clear that projection is always by ID and
>    the field name of a stats field should not be relied on
>    - added a table that defines current field stats types with their
>    respective offsets from the field ID of the base stats struct
>    - updated wording to make it clear that stats are calculated for
>    assigned field IDs that are
>       - defined in the table ID space (Amogh is working on a separate
>       proposal to unify ID spaces)
>       - defined in the reserved field ID
>       <https://iceberg.apache.org/spec/#reserved-field-ids> space
>    - added some examples showing table ID -> stats ID of stats struct and
>    also the stats ID of individual stats fields
>    - updated wording to explain how variant stats would look in the new
>    stats structure
>    - updated wording to make it clear that custom stats are not supported
>    and that expressions are the preferred way
>
> Please let me know in case I missed anything else to include.
>
> Thanks everyone for participating,
>
> Eduard
>
>
>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to