For the compatibility and version bump question, shouldn't this be a new
revision of the theta sketch rather than a new version of Puffin? I think
we want to avoid making changes to the file format itself if we can contain
the changes to the sketch definitions.

+1 for raising a vote once we have a PR with the full changes.

On Fri, Jul 5, 2024 at 9:35 AM Amogh Jahagirdar <amo...@apache.org> wrote:

> Hey all,
>
> Thanks everyone for providing your input on this discussion thread. There
> are 2 points I'd like to bring up, which in hindsight I should've
> recognized.
>
> If we look at the current Iceberg Improvement Process
> https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals
> and how proposals are adopted
> https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since
> this counts as spec change it should go through a formal vote process as JB
> alluded to.
>
> Next, even though this spec change may be considered trivial and Puffin is
> still in the early stages of adoption, the Puffin spec v1 as it is
> currently written was already voted on, see:
> https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q.
> Since the change I'm proposing is converting an optional property into
> required, what this probably means is we should add this as part of a
> Puffin spec v2 (like Szehon was referring to). Thinking about this more, I
> believe this is important because going forward it should be unambiguous on
> what the practice is when making spec changes like this.
>
> What I'm now proposing:
>
> 1.) Start a vote on making the NDV property required for theta sketch
> blobs in a Puffin V2 format.  Since there seems to be general support on
> this thread about making this property required, I think we can go into a
> vote, but to be very clear this requirement would apply for the next Puffin
> version.
>
> 2.) For the current PR for analyze table
> https://github.com/apache/iceberg/pull/10288 and other engine
> integrations with Puffin, they'd have to follow the spec as it is today
> which means doing a best effort read of the property, and if it does not
> exist, derive the NDV from the sketch.
>
> Please let me know your thoughts!
>
> Thanks,
>
> Amogh Jahagirdar
>
> On 2024/06/21 21:54:13 Amogh Jahagirdar wrote:
> > Hey all,
> >
> > I wanted to raise this thread to discuss a spec change proposal
> > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob
> > metadata property required for theta sketches. Currently, the spec is a
> bit
> > loose stating:
> >
> > The blob metadata for this blob *may* include following properties:
> >
> >    - ndv: estimate of number of distinct values, derived from the sketch
> >
> >
> > This came up on this PR
> > <
> https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695>
> > where
> > it came up that engines like Presto/Trino are using the property as a
> > source of truth and the implementation of the Spark procedure in the PR
> > originally was deriving the NDV from the sketch itself. It's currently
> > unclear what engine integrations should use as a source of truth.
> >
> > The main advantage of having it in the properties is that engines don't
> > have to go and deserialize the sketch/compute the NDV if they just want
> the
> > NDV (putting aside the intersection/union case where I think engines
> would
> > have to read the sketch). I think this makes it easier for engine
> > integration. The spec also currently makes it clear that the property
> must
> > be derived from the sketch so I don't think there's a "source of truth"
> > sync concern. It also should be easy for blob writers to set this
> property
> > since they'd anyways be populating the sketch in the first place.
> >
> > An alternative is to attempt to read the property and fallback to the
> > sketch (maybe abstract this behind an API) but this loses the advantage
> of
> > guaranteeing that engines don't have to read the sketch.
> >
> > The spec change to make the property required seems to be the consensus
> on
> > the PR thread but I wanted to bring it up here in case others had
> different
> > ideas or if I'm missing any problems with this approach!
> >
> >
> > Thank you,
> >
> > Amogh Jahagirdar
> >
>


-- 
Ryan Blue
Databricks

Reply via email to