Hi Kevin, Inserting all the `visit_id`s into a ThetaSketch by day will give you a distinct "set" for the day. You can then union those across the range on demand, and get a distinct over the arbitrary date range. The one caution I would make here is that unioning a very large set of sketches, or a set of very large sketches, is a fairly slow operation and if your ranges are very large and you need it to be quick (i.e. someone is live waiting on the request), you might consider pre-calculating sketches on a less granular level than 1 day as well, to reduce the sketch count. The alternative is not to pre-calculate anything at all, and to build a sketch for approximate distinct counts in memory, on the fly. Again, depends on data size, etc.
Hope that helps. Karl On Mon, May 23, 2022 at 6:20 PM Kevin Peng <kevin.p...@audigent.com> wrote: > Hi All, > > I am pretty new to the community and I am trying to get my head wrapped > around the usage of the theta sketch python library to compute approx > distinct counts. > > Here is my use case: > > - I have the following table structure: visit_id, dimension (array), > date (Single GMT day i.e. 1/1/2022) > - I want to run a distinct count of visit_ids over a dynamic date > range and group them by dimension sets i.e. select count(visit_id) where > date >= a and date <= b and dimension contains x or dimension contains y > and dimension contains z > > What I am planning is: > > - Create a theta sketch cube and store them in a hashtable i.e. > dynamodb using a workflow orchestration tool like airflow for each date > - Retrieve the theta sketch cubes for each day in the date range and > do union and intersection on request > > Here is my question: > > - I was trying to look at this example: > > https://github.com/apache/datasketches-cpp/blob/763f9249de576dca8c080fb4f3f438625a332b0b/python/tests/theta_test.py#L20 > - For creating the sketches should I be calculating the distinct > count group by date and dimension first and use that value with the key > being some combination of the dimension and date? > - Would the blob I store into the hashtable be the key that I > construct with the result returned back by the example > generate_theta_sketch method in the example test? > - If this is the case, in order to query a date range I would > have to construct a union of similar dimensions with different dates > within > the date range first before I can do any unions/intersections of > different > dimension values in that date range? Is there an easier way? > > > -- > Kevin Peng > Chief Engineer, DMP > 305.775.2463 > > >