Vijay, Sorry about the delay in getting back to you. There is some critical information missing from your description and that is the domain of what you are sketching. I presume that it is User-IDs, otherwise it doesn't make sense. If this is the case I think the solution can be achieved in a couple of ways. Going forward with this assumption:
Your raw data would look something like this: > {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} A proposed solution would be to configure a TupleSketch with 3 fields: (Impressions, Clicks, AdSpend). (This requires a little programming. You would need to create classes that implement the interfaces Summary, SummaryFactory and SummarySetOperations. There are examples of how to do this on the website.) Then you would process your raw data as follows: - Partition your raw data into 213 streams defined by the three dimensions: - 10 X AgeGroup - 3 X Gender - 200 X Country - Each stream would feed a configured TupleSketch as above. (so you need only 213 sketches). - These sketches are in the form of a 3D hypercube. - Each input tuple would feed 3 sketches, one in each dimension. A query would choose the desired coordinates of the 3 dimensions. If the query selects more than one coordinate from any one dimension, then you would need to first merge the corresponding coordinate sketches of that dimension together. A missing dimension would mean first merging all coordinate sketches of that dimension together. The final result sketch is an Intersection of the resulting 3 dimensional sketches. With the resulting Tuple sketch you iterate over all the retained items (Summaries) and sum each of the 3 fields. Then you divide each sum by Theta. The result is the estimate of Impressions, Clicks, and AdSpend for the population of users selected by the query. The size of that population would be obtained by getEstimate(). Be aware that Intersections, by their nature, can increase error significantly. (see the Website). Try to avoid doing intersections with sketch coordinates that have very small populations, if possible. You can get a sense of how good your results are by utilizing the getUpperBound() and getLowerBound() (with respect to the user population.). The bigger that spread is, the worse your estimates will be. The other option would be to do the cross-product upfront and create 6000 Tuple sketches. In this case the input tuple would feed only one sketch and the query would be to only one sketch. This can quickly get unwieldy with more dimensions or high cardinality dimensions. It is more accurate because it avoids the intersections. It is also possible to do a mix of these two approaches, but you would need to think it through carefully. I hope this helps. Lee. On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <vijay.sankar.ra...@gmail.com> wrote: > Hi, > Just like theta sketches are used for distinct count metrics like > impressions and clicks, is there a sketch (perhaps quantile?) that can be > used for metrics like ad_spend? If so, what are the error bounds? > > There is a big opportunity that I see in storing very little data in > sketches (which I use as a set) which makes retrieval of aggregate/analytic > data very fast (although it is approximate). > > This question is best explained with an example. Say I have a fact table > schema as follows > > > 1. *Age_Groups* is a dimension with 10 bucketed distinct values {less > than 18, 19-25, 26-35....} > 2. *Gender* is a dimension with 3 distinct values F, M & Unknown > 3. *Country* is a dimension with 200 possible values > 4. *Impressions* is a metric which is a long count (perhaps with Theta > sketch or HLL) > 5. *Clicks* is a metric which is a long count (perhaps with Theta > sketch or HLL) > 6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps > with quantile sketch??*) > > > The maximum possible number of entries in this table would be the cross > product of the cardinalities of the dimensions which is 10(Age_Group) x > 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I > could store only (10 + 3 + 200) * *3 sketches(**one each for impression, > clicks and Ad-Spend)* = 639 sketches and accomplish the group by queries > using set operations that sketches provides. > > For impression metric, one sketch for Gender=F, another sketch for > Gender=M, yet another for Gender=Unknown and so on for other dimensions as > well. > For click metric, one sketch for Gender=F, another sketch for Gender=M, > yet another for Gender=Unknown and so on for other dimensions as well. > Question is what sketch would I need for ad-spend? > > On a side note, I have used Theta sketches in this manner and even > implemented ECLAT for frequent itemset computations and association rule > mining ... example from another project below with theta sketches used for > count, I do not know how to do the same for a metric like Ad-Spend. > > Level Conjunction Count > 2 H10 & MemberGender=F 74 > 2 M15 & MemberGender=F 83 > 2 G31 & MemberGender=F 59 > 2 MemberGender=F & R13 66 > *In the example above, H10, M15 etc are International Disease codes for > specific diseases. *https://www.aapc.com/codes/code-search/ > > Hope this is a clear representation of the problem. > > Regards > Vijay > > > > >