Re: [DISCUSS] v4 - Improved column statistics

Eduard Tudenhöfner Thu, 24 Jul 2025 04:02:17 -0700

>    1. The current proposal only leaves 10000+200 ids for other columns
>    than stats. If in the future, we find some other feature which would
>    require a manifest file column for every data column in the table, then we
>    would need to change the spec.
>
> For this I think we could start at *100,000* so that we use *100,000 +
200 * <fieldID>* to calculate the field ID of a given statistic.



>
>    1. The current proposal expects every engine to share the same stats,
>    and not store any "non-standard" stat in the metadata.
>
> We haven't explicitly stated it in the proposal but there were discussions
on how to potentially support this and what implications it brings for
readers/writers


I'm still not clear on what the proposal is to handle stats for reserved
> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I
> think there was some mention in the notes but it was light on details). It
> seems like it would be potentially useful to have stats for things like
> _row_id, and the multiplication would overflow for these column IDs (maybe
> this still yields unique column IDs though?)
>

To handle stats for reserved columns we could start at *2,417,000,000*
which should give us enough room to store 200 stats per metadata ID. We
would also ensure that those ID ranges for table columns and reserved
columns wouldn't overlap.


I assume we could put whatever these columns are under stats? Maybe we just
> need a more generic name for the top level struct?


I haven't updated the proposal yet, but I think renaming *column_stats* to
*content_stats* would make sense.

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to