It makes sense to me, normally changing optional -> required would probably require a version bump, but maybe it is ok here as it is a relatively new format, afaik adapted by Trino which already sets this field, but let's see if anyone disagrees.
Thanks Szehon On Fri, Jun 21, 2024 at 3:35 PM huaxin gao <huaxin.ga...@gmail.com> wrote: > +1 for making the ndv blob metadata property required for theta sketches. > > On Fri, Jun 21, 2024 at 2:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > >> Hey all, >> >> I wanted to raise this thread to discuss a spec change proposal >> <https://github.com/apache/iceberg/pull/10549> for making the ndv blob >> metadata property required for theta sketches. Currently, the spec is a bit >> loose stating: >> >> The blob metadata for this blob *may* include following properties: >> >> - ndv: estimate of number of distinct values, derived from the sketch >> >> >> This came up on this PR >> <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> >> where >> it came up that engines like Presto/Trino are using the property as a >> source of truth and the implementation of the Spark procedure in the PR >> originally was deriving the NDV from the sketch itself. It's currently >> unclear what engine integrations should use as a source of truth. >> >> The main advantage of having it in the properties is that engines don't >> have to go and deserialize the sketch/compute the NDV if they just want the >> NDV (putting aside the intersection/union case where I think engines would >> have to read the sketch). I think this makes it easier for engine >> integration. The spec also currently makes it clear that the property must >> be derived from the sketch so I don't think there's a "source of truth" >> sync concern. It also should be easy for blob writers to set this property >> since they'd anyways be populating the sketch in the first place. >> >> An alternative is to attempt to read the property and fallback to the >> sketch (maybe abstract this behind an API) but this loses the advantage of >> guaranteeing that engines don't have to read the sketch. >> >> The spec change to make the property required seems to be the consensus >> on the PR thread but I wanted to bring it up here in case others had >> different ideas or if I'm missing any problems with this approach! >> >> >> Thank you, >> >> Amogh Jahagirdar >> >