Hi Kevin,

Inserting all the `visit_id`s into a ThetaSketch by day will give you a
distinct "set" for the day. You can then union those across the range on
demand, and get a distinct over the arbitrary date range. The one caution I
would make here is that unioning a very large set of sketches, or a set of
very large sketches, is a fairly slow operation and if your ranges are very
large and you need it to be quick (i.e. someone is live waiting on the
request), you might consider pre-calculating sketches on a less granular
level than 1 day as well, to reduce the sketch count. The alternative is
not to pre-calculate anything at all, and to build a sketch for approximate
distinct counts in memory, on the fly. Again, depends on data size, etc.

Hope that helps.

Karl

On Mon, May 23, 2022 at 6:20 PM Kevin Peng <kevin.p...@audigent.com> wrote:

> Hi All,
>
> I am pretty new to the community and I am trying to get my head wrapped
> around the usage of the theta sketch python library to compute approx
> distinct counts.
>
> Here is my use case:
>
>    - I have the following table structure: visit_id, dimension (array),
>    date (Single GMT day i.e. 1/1/2022)
>    - I want to run a distinct count of visit_ids over a dynamic date
>    range and group them by dimension sets i.e. select count(visit_id) where
>    date >= a and date <= b and dimension contains x or dimension contains y
>    and dimension contains z
>
> What I am planning is:
>
>    - Create a theta sketch cube and store them in a hashtable i.e.
>    dynamodb using a workflow orchestration tool like airflow for each date
>    - Retrieve the theta sketch cubes for each day in the date range and
>    do union and intersection on request
>
> Here is my question:
>
>    - I was trying to look at this example:
>    
> https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20
>       - For creating the sketches should I be calculating the distinct
>       count group by date and dimension first and use that value with the key
>       being some combination of the dimension and date?
>       - Would the blob I store into the hashtable be the key that I
>       construct with the result returned back by the example
>       generate_theta_sketch method in the example test?
>          - If this is the case, in order to query a date range I would
>          have to construct a union of similar dimensions with different dates 
> within
>          the date range first before I can do any unions/intersections of 
> different
>          dimension values in that date range?  Is there an easier way?
>
>
> --
> Kevin Peng
> Chief Engineer, DMP
> 305.775.2463
>
>
>

Reply via email to