+1, to make *apache-datasketches-theta-v1* sketch has the ndv blob property always present
On Sat, Jun 22, 2024 at 3:20 AM Piotr Findeisen <piotr.findei...@gmail.com> wrote: > Hi, > > non-binding +1 to require that `apache-datasketches-theta-v1` sketch has > ndv blob property set. > > Best > PF > > > On Sat, 22 Jun 2024 at 06:27, Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Amogh >> >> +1 to have ndv blob metadata property required. >> >> As discussed during the last community meeting, we discussed voting on >> all code modification changes on spec. >> For the next spec changes, I would propose to start a voting thread, >> like "[VOTE][Puffin Spec] Make the ndv blob metadata property required >> for theta sketches". >> >> Thanks ! >> Regards >> JB >> >> On Fri, Jun 21, 2024 at 11:54 PM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> > >> > Hey all, >> > >> > I wanted to raise this thread to discuss a spec change proposal for >> making the ndv blob metadata property required for theta sketches. >> Currently, the spec is a bit loose stating: >> > >> > The blob metadata for this blob may include following properties: >> > >> > ndv: estimate of number of distinct values, derived from the sketch >> > >> > >> > This came up on this PR where it came up that engines like Presto/Trino >> are using the property as a source of truth and the implementation of the >> Spark procedure in the PR originally was deriving the NDV from the sketch >> itself. It's currently unclear what engine integrations should use as a >> source of truth. >> > >> > The main advantage of having it in the properties is that engines don't >> have to go and deserialize the sketch/compute the NDV if they just want the >> NDV (putting aside the intersection/union case where I think engines would >> have to read the sketch). I think this makes it easier for engine >> integration. The spec also currently makes it clear that the property must >> be derived from the sketch so I don't think there's a "source of truth" >> sync concern. It also should be easy for blob writers to set this property >> since they'd anyways be populating the sketch in the first place. >> > >> > An alternative is to attempt to read the property and fallback to the >> sketch (maybe abstract this behind an API) but this loses the advantage of >> guaranteeing that engines don't have to read the sketch. >> > >> > The spec change to make the property required seems to be the consensus >> on the PR thread but I wanted to bring it up here in case others had >> different ideas or if I'm missing any problems with this approach! >> > >> > >> > Thank you, >> > >> > Amogh Jahagirdar >> >