Hi Karik,
The problem you describe is typical for on-line advertising and similar to
ones we have worked on before.  Solving this problem with sketches will
provide approximate results in near-real time.  However, doing so even with
sketches may require considerably more resources than you may be
planning.  When
you said "we don’t have a data warehouse and ad-hoc OLAP",  I am hoping you
are not thinking that you can solve this problem at the scale that you
described with just a few machines.  Nonetheless, if you attempted to solve
this problem brute-force to produce exact answers, the resources required
would be far larger and the query latency would be orders-of-magnitude
longer.

Your queries Q1 and Q2 are both similar in that you are seeking a unique
count of some identifier, either UserID or AdImpressionID, qualified by
dimensions. However they are very different in that the domain of users is
completely unrelated to the domain of ad impressions.  You cannot take a
sketch from the user domain and merge (Union, Intersection, Difference)
with a sketch from the ad impression domain.  Your statement: "So after
roll-up, I would end up with 1 theta-sketch per dimension value per
day(assuming day level granularity) and I can do unions and intersections
on these sketches to answer Q2" makes no sense to me.  Your raw data
records need to be much richer and something similar to the form (but
likely much more complex):

   - Date-Time-Stamp, Dim-name, Dim-value, UserID, AdImpressionID

Presumably, your AdImpressionID can be correlated back to another table
that provides richer dimensional information about the Ad impression, but
at least this record would allow correlation of unique users to ad
impressions.

The Theta sketch family provides approximate intersections and differences,
but only against the originating domain of the data fed to the sketch. You
can intersect unique users of one sector with unique users of another
sector.  But you can't logically intersect a sketch of unique users with a
sketch of unique ad impressions as the result is totally meaningless.

It sounds like what you are asking for is a full SQL-type Join (which is a
type of intersection).

We don't have a sketch that directly addresses the full Join, but we have
the Tuple sketch, which is derived from the Theta sketch, that may be able
to help you.  I would encourage you to read up on the Tuple Sketch on our
website, and come back with more questions if it seems that the Tuple
Sketch might be of interest.

***
Other comments to think about:

   - You stated that your time granularity was a "day" and that you had
   users in countries all over the world.  I'm sure you are aware that the
   definition of a "day" varies all over the world with about 40 or so
   time-zones.  You might find it more useful to define your time granularity
   to be an hour, for example, and create rotating time windows, and then with
   producing queries for a particular time-zone, simply merge the 24 sketches
   for that time zone.
   - With respect to Druid, we have been using sketches with Druid for many
   years very successfully.  There are other systems with DataSketches
   integration as well, including PostgreSQL, Thinking Machine, Permutive, and
   a few more that are in the process of integration with DataSketches such as
   Impala and Pinot.

Cheers,
Lee.






On Tue, Sep 14, 2021 at 10:09 PM Kartik Mahajan <dhaki...@gmail.com> wrote:

> Hi,
>
> We’re currently serving 100 Billion ad impressions per day across our 6
> data centers. Out of these, we are serving about 80 Billion ad impressions
> in the US alone.
> Each ad impression can have hundreds of attributes(dimensions) e.g
> Country, City, Brower, OS, Custom-Parameters from web-page, ad-size, ad-id,
> site-id etc
>
> Currently, we don’t have a data warehouse and ad-hoc OLAP support is
> pretty much non-existent in our organization. This severely limits our
> ability to run adhoc queries and get a quick grasp of data.
>
> We want to answer the following 2 queries to begin with :-
>
> Q1) Find the total count of ad impressions which were served from "beginDate" 
> to "endDate" where Dimension1 = d1 and Dimension2 = d2 .... .. Dimensionk = 
> d_k
>
> Q2) Find the total count of unique users which saw our ads from "beginDate" 
> to "endDate" where Dimension1 = d1 and/or Dimension2 = d2 .... .. Dimensionk 
> = d_k
>
> As I said each impression can have hundreds of dimensions(listed above)
> and cardinality of each dimension could be from few hundreds(say for
> dimension Country) to Billions(for e.g User-id).
>
> We want approximate answers and the least infrastructure cost and query
> response time within < 5 minutes. I am thinking about using Druid and Apache
> datasketches <https://datasketches.apache.org/>(Theta Sketch to be
> precise) for answering Q2 and using the following data-model :-
>
> Date Dimension Name     Dimension Value        Unique-User-ID(Theta sketch)
> 2021/09/12   "Country"        "US"             37873-3udif-83748-2973483
> 2021/09/12   "Browser"       "Chrome"          37873-3aeuf-83748-2973483
> .
> .<Other records>
>
> So after roll-up, I would end up with 1 theta-sketch per dimension value
> per day(assuming day level granularity) and I can do unions and
> intersections on these sketches to answer Q2)
>
> I am planning to set k(nominal entries) to 10^5(please comment about what
> would be suitable k for this use case and expected storage amount required?)
>
> I’ve also read that the about theta sketch set ops accuracy here
> <https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html>
>
> I would like to know if there is a better approach to solve Q2(with or
> without Druid)
>
> Also I would like to know how can I solve Q1?
>
> If I replace Unique-User-Id with “Impression-Id”, can I use the same data
> model to answer Q1? I believe that if I replace Unique-User-Id with
> “Impression-Id” then accuracy to count the total impressions would be way
> worse than that of Q2, because each ad-impression is assigned a unique id
> and we are currently serving 250 Billion per day.
>
> Please share your thoughts about solving Q1 and Q2.
>
> Regards
> kartik
>

Reply via email to