Re: [DISCUSS] v4 - Improved column statistics

Micah Kornfield Thu, 24 Jul 2025 12:02:12 -0700

After having thought about it some more, my current point of view is
proceeding with something as simple as possible for V4 (I tried to
formalize what I think the proposed algorithm is in the original proposal
doc
<https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0>
[1]).
If in the course of V4 development we find some flaw with the simple
approach we can revise it (e.g. we run out of space).  If something comes
up after V4, we are not talking about a lot of code either way, so having a
new scheme for V5+ would not be a major burden (all manifests are now
written with a spec version, so detection is easy).


Given the potential complications with custom stats, I think it is
reasonable to allow implementations that want custom stats to use the upper
bound of reserved offset range (e.g. we have 6 reserved out of 200 today,
if implementations really need custom stats then they can start using
offset 199, and then 198, etc).  This poses a low risk of overlap in the
short term, and I assume those using custom stats would have tight control
over their environment anyways, so they have the ability to manage
conflicts, compactions, in a way that fits them.


> Another thing that both Russel and Ryan brought up is being able to track
> stats for sort orders or expressions, but they don't share an id space with
> field ids.


Slightly off topic, but is there a reason we can't unify the field ID range
for V4?

I feel like it would be better to work to formalize the stats so that they
> are known and easier to project, but it's also hard to get agreement for
> more complicated stats (like coalitions that have very specific character
> set handling), but I think using expressions in lieu of custom stats might
> address all of these cases and would be more straightforward for the
> copy-forward requirement.


Also off topic, but doesn't this just shift the burden of
standardardization to expressions?  This might be controversial but maybe
the bar for adding a new stat type should be relatively low?  They are
optional anyways, we can maybe define some stats as core (implementations
are incomplete if they can't produce them) and others as non-core (not
required for implementations, there can be optional configuration to either
block writes that require producing the stats or just drop them).

[1]
https://docs.google.com/document/d/1uvbrwwAJW2TgsnoaIcwAFpjbhHkBUL5wY_24nKgtt9I/edit?tab=t.0



On Thu, Jul 24, 2025 at 11:19 AM Daniel Weeks <dwe...@apache.org> wrote:

> The current proposal only leaves 10000+200 ids for other columns than
>> stats. If in the future, we find some other feature which would require a
>> manifest file column for every data column in the table, then we would need
>> to change the spec.
>
>
> I do think we might want to put an upper bound on the column stats.  Ryan
> calculated the upper bound of what can be represented, but I don't think we
> need to accommodate 10m+ field ids and that would block the entire id
> range.  It might make more sense to simply put an upper bound on the stats
> space (e.g. 100k or 1m fields?).  This would leave plenty of space for
> future evolution of the spec without having to redefine the stats range.
>
> Another thing that both Russel and Ryan brought up is being able to track
> stats for sort orders or expressions, but they don't share an id space with
> field ids.  We might want to decide what the full stats space should look
> like.  For example:
>
> 8k+ sort orders
> 9k+ expressions
> 10+ field ids
> 1m+ <unreserved>
> MAX_VALUE - 200 <reserved per spec>
>
> Since sort orders and expressions have much lower cardinality than field
> ids, we can probably have a more constrained range.
>
> I'm leaning against custom stats because it does increase complexity for
> all writers as Micah mentioned and introduces the potential for id space
> collision.  It would also easily compromise the performance of engines if
> other writers drop them (via compaction or just any metadata rewrite
> operation).  I feel like it would be better to work to formalize the stats
> so that they are known and easier to project, but it's also hard to get
> agreement for more complicated stats (like coalitions that have very
> specific character set handling), but I think using expressions in lieu of
> custom stats might address all of these cases and would be more
> straightforward for the copy-forward requirement.
>
> -Dan
>
>
>
> On Thu, Jul 24, 2025 at 4:03 AM Eduard Tudenhöfner
> <eduard.tudenhoef...@databricks.com.invalid> wrote:
>
>>
>>
>>
>>>    1. The current proposal only leaves 10000+200 ids for other columns
>>>    than stats. If in the future, we find some other feature which would
>>>    require a manifest file column for every data column in the table, then 
>>> we
>>>    would need to change the spec.
>>>
>>> For this I think we could start at *100,000* so that we use *100,000 +
>> 200 * <fieldID>* to calculate the field ID of a given statistic.
>>
>>
>>>
>>>    1. The current proposal expects every engine to share the same
>>>    stats, and not store any "non-standard" stat in the metadata.
>>>
>>> We haven't explicitly stated it in the proposal but there were
>> discussions on how to potentially support this and what implications it
>> brings for readers/writers
>>
>>
>> I'm still not clear on what the proposal is to handle stats for reserved
>>> columns <https://iceberg.apache.org/spec/#reserved-field-ids> [1] (I
>>> think there was some mention in the notes but it was light on details). It
>>> seems like it would be potentially useful to have stats for things like
>>> _row_id, and the multiplication would overflow for these column IDs (maybe
>>> this still yields unique column IDs though?)
>>>
>>
>> To handle stats for reserved columns we could start at *2,417,000,000*
>> which should give us enough room to store 200 stats per metadata ID. We
>> would also ensure that those ID ranges for table columns and reserved
>> columns wouldn't overlap.
>>
>>
>> I assume we could put whatever these columns are under stats? Maybe we
>>> just need a more generic name for the top level struct?
>>
>>
>> I haven't updated the proposal yet, but I think renaming *column_stats*
>> to *content_stats* would make sense.
>>
>>
>>

Re: [DISCUSS] v4 - Improved column statistics

Reply via email to