Agreed this isn't a Puffin format change copying comment from the "Spec: Make NDV blob metadata property required" PR <https://github.com/apache/iceberg/pull/10549>
Puffin fole format has place for versioning within the magic, but the > Puffin format doesn't change, only its use changes. Puffin spec is not > authoritative source of information of all blob types contained within > Puffin files, it's open for extension. And thus, Iceberg may decide that > whenever apache-datasketches-theta-v1 blob type is used in Puffin, the blob > can be, should be or must be associated with ndv property. Ryan, i think we could add some intro to this section https://iceberg.apache.org/puffin-spec/#blob-types explaining the reader whether the documentation contained in this section is considered authoritative and complete, or eg any other blob types or other properties can be expected. Best PF On Fri, 5 Jul 2024 at 19:19, Ryan Blue <b...@databricks.com.invalid> wrote: > For the compatibility and version bump question, shouldn't this be a new > revision of the theta sketch rather than a new version of Puffin? I think > we want to avoid making changes to the file format itself if we can contain > the changes to the sketch definitions. > > +1 for raising a vote once we have a PR with the full changes. > > On Fri, Jul 5, 2024 at 9:35 AM Amogh Jahagirdar <amo...@apache.org> wrote: > >> Hey all, >> >> Thanks everyone for providing your input on this discussion thread. There >> are 2 points I'd like to bring up, which in hindsight I should've >> recognized. >> >> If we look at the current Iceberg Improvement Process >> https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals >> and how proposals are adopted >> https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since >> this counts as spec change it should go through a formal vote process as JB >> alluded to. >> >> Next, even though this spec change may be considered trivial and Puffin >> is still in the early stages of adoption, the Puffin spec v1 as it is >> currently written was already voted on, see: >> https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q. >> Since the change I'm proposing is converting an optional property into >> required, what this probably means is we should add this as part of a >> Puffin spec v2 (like Szehon was referring to). Thinking about this more, I >> believe this is important because going forward it should be unambiguous on >> what the practice is when making spec changes like this. >> >> What I'm now proposing: >> >> 1.) Start a vote on making the NDV property required for theta sketch >> blobs in a Puffin V2 format. Since there seems to be general support on >> this thread about making this property required, I think we can go into a >> vote, but to be very clear this requirement would apply for the next Puffin >> version. >> >> 2.) For the current PR for analyze table >> https://github.com/apache/iceberg/pull/10288 and other engine >> integrations with Puffin, they'd have to follow the spec as it is today >> which means doing a best effort read of the property, and if it does not >> exist, derive the NDV from the sketch. >> >> Please let me know your thoughts! >> >> Thanks, >> >> Amogh Jahagirdar >> >> On 2024/06/21 21:54:13 Amogh Jahagirdar wrote: >> > Hey all, >> > >> > I wanted to raise this thread to discuss a spec change proposal >> > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob >> > metadata property required for theta sketches. Currently, the spec is a >> bit >> > loose stating: >> > >> > The blob metadata for this blob *may* include following properties: >> > >> > - ndv: estimate of number of distinct values, derived from the sketch >> > >> > >> > This came up on this PR >> > < >> https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695 >> > >> > where >> > it came up that engines like Presto/Trino are using the property as a >> > source of truth and the implementation of the Spark procedure in the PR >> > originally was deriving the NDV from the sketch itself. It's currently >> > unclear what engine integrations should use as a source of truth. >> > >> > The main advantage of having it in the properties is that engines don't >> > have to go and deserialize the sketch/compute the NDV if they just want >> the >> > NDV (putting aside the intersection/union case where I think engines >> would >> > have to read the sketch). I think this makes it easier for engine >> > integration. The spec also currently makes it clear that the property >> must >> > be derived from the sketch so I don't think there's a "source of truth" >> > sync concern. It also should be easy for blob writers to set this >> property >> > since they'd anyways be populating the sketch in the first place. >> > >> > An alternative is to attempt to read the property and fallback to the >> > sketch (maybe abstract this behind an API) but this loses the advantage >> of >> > guaranteeing that engines don't have to read the sketch. >> > >> > The spec change to make the property required seems to be the consensus >> on >> > the PR thread but I wanted to bring it up here in case others had >> different >> > ideas or if I'm missing any problems with this approach! >> > >> > >> > Thank you, >> > >> > Amogh Jahagirdar >> > >> > > > -- > Ryan Blue > Databricks >