For the compatibility and version bump question, shouldn't this be a new revision of the theta sketch rather than a new version of Puffin? I think we want to avoid making changes to the file format itself if we can contain the changes to the sketch definitions.
+1 for raising a vote once we have a PR with the full changes. On Fri, Jul 5, 2024 at 9:35 AM Amogh Jahagirdar <amo...@apache.org> wrote: > Hey all, > > Thanks everyone for providing your input on this discussion thread. There > are 2 points I'd like to bring up, which in hindsight I should've > recognized. > > If we look at the current Iceberg Improvement Process > https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals > and how proposals are adopted > https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since > this counts as spec change it should go through a formal vote process as JB > alluded to. > > Next, even though this spec change may be considered trivial and Puffin is > still in the early stages of adoption, the Puffin spec v1 as it is > currently written was already voted on, see: > https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q. > Since the change I'm proposing is converting an optional property into > required, what this probably means is we should add this as part of a > Puffin spec v2 (like Szehon was referring to). Thinking about this more, I > believe this is important because going forward it should be unambiguous on > what the practice is when making spec changes like this. > > What I'm now proposing: > > 1.) Start a vote on making the NDV property required for theta sketch > blobs in a Puffin V2 format. Since there seems to be general support on > this thread about making this property required, I think we can go into a > vote, but to be very clear this requirement would apply for the next Puffin > version. > > 2.) For the current PR for analyze table > https://github.com/apache/iceberg/pull/10288 and other engine > integrations with Puffin, they'd have to follow the spec as it is today > which means doing a best effort read of the property, and if it does not > exist, derive the NDV from the sketch. > > Please let me know your thoughts! > > Thanks, > > Amogh Jahagirdar > > On 2024/06/21 21:54:13 Amogh Jahagirdar wrote: > > Hey all, > > > > I wanted to raise this thread to discuss a spec change proposal > > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob > > metadata property required for theta sketches. Currently, the spec is a > bit > > loose stating: > > > > The blob metadata for this blob *may* include following properties: > > > > - ndv: estimate of number of distinct values, derived from the sketch > > > > > > This came up on this PR > > < > https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> > > where > > it came up that engines like Presto/Trino are using the property as a > > source of truth and the implementation of the Spark procedure in the PR > > originally was deriving the NDV from the sketch itself. It's currently > > unclear what engine integrations should use as a source of truth. > > > > The main advantage of having it in the properties is that engines don't > > have to go and deserialize the sketch/compute the NDV if they just want > the > > NDV (putting aside the intersection/union case where I think engines > would > > have to read the sketch). I think this makes it easier for engine > > integration. The spec also currently makes it clear that the property > must > > be derived from the sketch so I don't think there's a "source of truth" > > sync concern. It also should be easy for blob writers to set this > property > > since they'd anyways be populating the sketch in the first place. > > > > An alternative is to attempt to read the property and fallback to the > > sketch (maybe abstract this behind an API) but this loses the advantage > of > > guaranteeing that engines don't have to read the sketch. > > > > The spec change to make the property required seems to be the consensus > on > > the PR thread but I wanted to bring it up here in case others had > different > > ideas or if I'm missing any problems with this approach! > > > > > > Thank you, > > > > Amogh Jahagirdar > > > -- Ryan Blue Databricks