Hi,

The sketches are string-fed.

Some of the sketches are built using Spark and the Hive functions from the
datasketches library, while others are built using a kafka streams job.
It's quite likely the covered period contains some sketches built by Spark
and some by the streaming job, but I can't tell where the exact cutoff was.
The Spark job is using org.apache.datasketches.hive.hll.DataToSketchUDAF
The streaming job is building the sketches through Union objects (receives
a stream of sketches, makes unions out of individual pairs, forwards the
result as sketch).

After some adjustments to the queries I'm running to get the exact counts,
to take care of local times, etc..., these should be the correct values
with excluded days:
Without first day: 24890
Without first and second day: 22989

Thanks,
Marko


On Fri, 14 Aug 2020 at 17:08, leerho <lee...@gmail.com> wrote:

> Hi Marko,
> I notice that the first two sketches are the result of union operations,
> while the remaining sketches are pure streaming sketches.
> Could you perform Jon's request again except excluding the first two
> sketches?
>
> Just to cover the bases, could you explain the types of the
> data items that are being fed to the sketches?  Are your identifiers
> strings, longs or what?
>
> Thanks,
> Lee.
>
> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jon.mal...@gmail.com> wrote:
>
>> Thanks! We're investigating. We'll let you know if we have further
>> questions.
>>
>>   jon
>>
>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <marko.musn...@gmail.com>
>> wrote:
>>
>>> Hi Jon,
>>> The first sketch is the one where I see the jump. The exact count
>>> without the first sketch is 24765.
>>>
>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 is
>>> within 2%.
>>>
>>> Thanks,
>>> Marko
>>>
>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jon.mal...@gmail.com> wrote:
>>>
>>>> Hi Marko,
>>>>
>>>> Could you please let us know two more things:
>>>> 1) Which is the one particular sketch that causes the estimate to jump?
>>>> 2) What is the exact unique count of the others without that sketch?
>>>>
>>>> It sort of seems like the first sketch, but it's hard to know for sure
>>>> since we don't know the true leave-one-out exact counts.
>>>>
>>>> Thanks,
>>>>   jon
>>>>
>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <marko.musn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Could someone help me understand a behavior I see when trying to union
>>>>> some HLL sketches?
>>>>>
>>>>> I have 14 HLL sketches, and I know the exact unique counts for each of
>>>>> them. All the individual sketches give estimates within 2% of the exact
>>>>> counts.
>>>>>
>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>> results in total estimate that is way off (25% larger then exact count).
>>>>>
>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 gives
>>>>> results that are within 2.5% of the exact counts.
>>>>>
>>>>> Also, one particular sketch seems to cause the final estimate to jump
>>>>> - not adding that sketch to the union keeps the result close to the exact
>>>>> count.
>>>>>
>>>>> Am I just seeing a very bad random error, or is there anything I'm
>>>>> doing wrong with the unions?
>>>>>
>>>>> Running on Java, using version 1.3.0. Just in case, the sketches are
>>>>> in the linked gist (hex encoded, one per line):
>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>> and the exact counts:
>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>
>>>>> Thank you!
>>>>> Marko Musnjak
>>>>>
>>>>>

Reply via email to