+1, to make *apache-datasketches-theta-v1* sketch has the ndv blob property
always present

On Sat, Jun 22, 2024 at 3:20 AM Piotr Findeisen <piotr.findei...@gmail.com>
wrote:

> Hi,
>
> non-binding +1 to require that `apache-datasketches-theta-v1` sketch has
> ndv blob property set.
>
> Best
> PF
>
>
> On Sat, 22 Jun 2024 at 06:27, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Amogh
>>
>> +1 to have ndv blob metadata property required.
>>
>> As discussed during the last community meeting, we discussed voting on
>> all code modification changes on spec.
>> For the next spec changes, I would propose to start a voting thread,
>> like "[VOTE][Puffin Spec] Make the ndv blob metadata property required
>> for theta sketches".
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Fri, Jun 21, 2024 at 11:54 PM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>> >
>> > Hey all,
>> >
>> > I wanted to raise this thread to discuss a spec change proposal for
>> making the ndv blob metadata property required for theta sketches.
>> Currently, the spec is a bit loose stating:
>> >
>> > The blob metadata for this blob may include following properties:
>> >
>> > ndv: estimate of number of distinct values, derived from the sketch
>> >
>> >
>> > This came up on this PR where it came up that engines like Presto/Trino
>> are using the property as a source of truth and the implementation of the
>> Spark procedure in the PR originally was deriving the NDV from the sketch
>> itself. It's currently unclear what engine integrations should use as a
>> source of truth.
>> >
>> > The main advantage of having it in the properties is that engines don't
>> have to go and deserialize the sketch/compute the NDV if they just want the
>> NDV (putting aside the intersection/union case where I think engines would
>> have to read the sketch). I think this makes it easier for engine
>> integration. The spec also currently makes it clear that the property must
>> be derived from the sketch so I don't think there's a "source of truth"
>> sync concern. It also should be easy for blob writers to set this property
>> since they'd anyways be populating the sketch in the first place.
>> >
>> > An alternative is to attempt to read the property and fallback to the
>> sketch (maybe abstract this behind an API) but this loses the advantage of
>> guaranteeing that engines don't have to read the sketch.
>> >
>> > The spec change to make the property required seems to be the consensus
>> on the PR thread but I wanted to bring it up here in case others had
>> different ideas or if I'm missing any problems with this approach!
>> >
>> >
>> > Thank you,
>> >
>> > Amogh Jahagirdar
>>
>

Reply via email to