Hi Alex, You are right. I assumed that quantiles would have an intersection like Theta Sketches but they don't.
If it did based on transaction ID, it would have been cool. So many traditional data mining & predictive analytics algorithms could be reimplemented with sketches. Thanks for your insight. Vijay On Mon, May 23, 2022, 22:18 Alexander Saydakov <sayda...@yahooinc.com> wrote: > This seems to be a convoluted way of computing the sum of all values. This > is an additive metric, easy to compute exactly, no sketches needed. > > On Sun, May 22, 2022 at 5:56 AM vijay rajan <vijay.sankar.ra...@gmail.com> > wrote: > >> Hi folks (And Lee), >> >> I think I have found what I was looking for in quantile sketches though I >> am not able to formulate error bounds for the same. >> I should have raised a PR request but I am going to write the code here. >> The code below estimates the volume of the quantile sketche based on the >> example >> https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html >> <https://urldefense.com/v3/__https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html__;!!Op6eflyXZCqGR5I!CL-lHnjU1bAay6Rk9U0ORoIKw-HDloG81s_Kve9k0arHZpZRyLYDHO0iJGTgIHqo-4SrlYoii5ErA1X6UVOv50QOcuzs$> >> . >> >> The assumption is that the values put in the sketch are all greater than >> 0. It kind of works like this. >> 1. Get 100 quantiles using getQuantiles. >> 2. Get Histogram percentage numbers >> 3. The average of the boundaries of the quantiles(getMinValue and >> getMaxValue for the extremities) will be used with getN to estimate the >> volume. >> >> Code is straightforward. I think this was what I was looking for and >> honestly I do not care if the max and min values are really extreme >> compared to the rest of the numbers as my use case is supposed to remove >> outliers. >> >> private static double estimateVolume(DoublesSketch sketch) { >> double retVal = 0.0; >> int n = 100; >> >> double [] quantiles = sketch.getQuantiles(n); >> int totalItems = sketch.getRetainedItems(); >> double [] histogramCounts = new double[n]; >> >> //Arrays.sort(quantiles); >> histogramCounts = sketch.getPMF(quantiles); >> >> //compute the volume >> retVal += histogramCounts[0] * sketch.getN() * ((sketch.getMinValue() + >> quantiles[0]) / 2); >> for (int i = 1; i < n-1; i++) { >> retVal += histogramCounts[i] * sketch.getN() * (( quantiles[i-1]+ >> quantiles[i]) / 2); >> } >> retVal += histogramCounts[n-1] * sketch.getN() * ((sketch.getMaxValue() >> + quantiles[n-1]) / 2); >> >> return retVal; >> } >> >> >> >> Regards >> Vijay >> >> >> >> >> >> On Mon, May 2, 2022 at 10:10 PM Will Lauer <wla...@yahooinc.com> wrote: >> >>> OK, this is interesting. I've got some concerns and questions that I've >>> put inline below. >>> >>> Will >>> >>> >>> >>> Will Lauer >>> >>> >>> Senior Principal Architect, Audience & Advertising Reporting >>> >>> Data Platforms & Systems Engineering >>> >>> >>> M 508 561 6427 >>> >>> Champaign Office >>> >>> 1908 S. First St >>> >>> Champaign, IL 61822 >>> >>> >>> >>> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com> >>> wrote: >>> >>>> Thanks Lee. Please find my answers inline below in blue. I think you >>>> will find my use case very interesting. My next endeavor would be to make a >>>> decision tree with entropy / gini impurity measures with sketches. I am >>>> amazed at some of the results I have gotten. You may find this quite >>>> interesting. >>>> >>>> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote: >>>> >>>>> Vijay, >>>>> Sorry about the delay in getting back to you. >>>>> >>>>> There is some critical information missing from your description and >>>>> that is the domain of what you are sketching. >>>>> I presume that it is User-IDs, otherwise it doesn't make sense. >>>>> If this is the case I think the solution can be achieved in a couple >>>>> of ways. Going forward with this assumption: >>>>> >>>> >>>>> Your raw data would look something like this: >>>>> >>>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>>>> >>>>> >>>>> >>>> Actually it would be an event_id instead of user_Id. The domain for the >>>> data is a hypothetical online advertising summary. >>>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>>> >>>> The problem that we faced in the past was that it was impossible to >>>> "join" a click to an impression. While the raw impression data would have a >>>> unique event_id the raw click data would also have its own event_id. The >>>> dimensions were the same for clicks and impressions. What we did back in >>>> thge day was aggregate the data for impressions separately and then >>>> aggregate for clicks separately and then join. >>>> >>>> You could assume the same here. Each dimension aggregate tuple >>>> {AgeGroup, Gender and Country in out case} would have a theta sketch for >>>> impressions and another for clicks. >>>> >>> >>> I don't see what a "theta sketch for impressions and another for clicks" >>> would mean here. Theta sketches are for doing distinct counts, so I could >>> understand a theta sketch for EventId, counting the number of distinct >>> event ids associated with the AgeGroup/Gender/Country combination. But that >>> doesn't seem to make sense for clicks or impressions. If your original data >>> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it >>> implies to me that clicks and impressions would always be 1 (that is, an >>> event is either a single click or single impression). When you drop the >>> EventId dimension, the natural aggregation for these (as well as for >>> AdSpend) is summation, giving you the total clicks, total impressions (and >>> total AdSpend) for the Age/Gender/Country combination. If you were to >>> generate a Theta sketch, it seems like it would be non-sensical, as you all >>> impressions or clicks inserted would be the same value and the theta sketch >>> would always return "1". >>> >>>> >>>> >>>> >>>>> A proposed solution would be to configure a TupleSketch with 3 fields: >>>>> (Impressions, Clicks, AdSpend). >>>>> (This requires a little programming. You would need to create classes >>>>> that implement the interfaces Summary, >>>>> SummaryFactory and SummarySetOperations. There are examples of how to >>>>> do this on the website.) >>>>> >>>>> >>>> I will look into Tuple sketch and the SummaryFactory and Summary Set >>>> operations. I did try reading it but I cannot say I understood this >>>> completely. What type of sketch would we use for AdSpend? Would it be >>>> Quantile? >>>> >>> >>>> >>>> >>>>> Then you would process your raw data as follows: >>>>> >>>>> - Partition your raw data into 213 streams defined by the three >>>>> dimensions: >>>>> - 10 X AgeGroup >>>>> - 3 X Gender >>>>> - 200 X Country >>>>> - Each stream would feed a configured TupleSketch as above. (so >>>>> you need only 213 sketches). >>>>> - These sketches are in the form of a 3D hypercube. >>>>> - Each input tuple would feed 3 sketches, one in each dimension. >>>>> >>>>> >>>> I believe I did something similar. Let me try explaining this with >>>> concrete examples >>>> >>>> 1. This is the raw data >>>> >>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please >>>> click on the link https://tinyurl.com/3d2r7xjh >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$> >>>> ) for a company in India that provides car drivers for hire. The data >>>> has >>>> most columns as categorical variables except the first which is tripId. >>>> The >>>> schema looks something like this >>>> 1. { >>>> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >>>> *tripOutcome* } >>>> 2. tripId the first column, is distinct per row which I put into >>>> theta sketches as shown later >>>> 3. tripOutcome the last column, is a classification variable >>>> which has values "bad or good" >>>> 4. The problem statement is to dod data analysis of combinations >>>> of dimensions that produce an unfavourable result that is >>>> "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high >>>> 2. I was able to populate theta sketches for each dimension and >>>> value pair (example "city=Mumbai") and store them as my effective data >>>> representation. >>>> 1. with 10 as base and hence nominal entries as 2^10 the data is >>>> https://tinyurl.com/bkvz4xrh >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$> >>>> >>>> 2. with 12 as base and hence nominal entries as 2^12 the data is >>>> https://tinyurl.com/f9mzm8ef >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$> >>>> >>>> 3. with 14 as base and hence nominal entries as 2^14 the data is >>>> https://tinyurl.com/2p8kwj3m >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >>>> 4. The data looks like this >>>> 1. dimension=value, Base64 encoded Theta sketch formed from >>>> unique tripId >>>> >>>> I'm not sure what you are sketching here. What is the entry being >>> inserted into the sketch? Basically, what are you counting? Normally, I >>> would think of sketching as taking your original data ( >>> tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >>> tripOutcome ) and picking a column (or columns) that you want to count >>> the distinct values of and generating a new table that contains the >>> dimensions you care about plus that new sketch. In this case, if you are >>> counting trips, the resulting table would look like {source, >>> estimate_usage_bins, city, zone, zone_demand_popularity, dayType, >>> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could >>> then report on a variety of combinations by dropping dimensions and >>> aggregating (merging) the sketches. >>> >>>> >>>> 1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT >>>> https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17 >>>> >>>> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$> >>>> ] to generate a combination of "dimension=value" with a minimum support >>>> threshold. The input to this implementation takes the files above at >>>> point >>>> 2. https://tinyurl.com/2p8kwj3m >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >>>> 2. I produced frequent itemsets for both sides i.e tripOutcome=bad >>>> and tripOutcome=good and then join the results. The answers I get are >>>> listed below >>>> 1. with input having dimension=value and theta sketch with 2^10 >>>> nominal entries answers can be found here >>>> https://tinyurl.com/yc3a3rde >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$> >>>> >>>> 2. with input having dimension=value and theta sketch with 2^12 >>>> nominal entries answers can be found here >>>> https://tinyurl.com/5n7jjr32 >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$> >>>> 3. with input having dimension=value and theta sketch with 2^14 >>>> nominal entries answers can be found here >>>> https://tinyurl.com/3wrxp4a7 >>>> >>>> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$> >>>> 3. The results show that 2^14 (16K) nominal sketches give >>>> immaculate results. There are errors in the results from the other >>>> sketches >>>> (RMSE). >>>> 4. Sample of results >>>> 1. >>>> >>>> *level of combination, rule for combination, bad count, good >>>> count, percentage of bad count* >>>> 2. >>>> >>>> 4,booking_type=one_way_trip & city=Pune & sla=Scheduled & >>>> source=android,26.0,9.0,35.0,74.0 >>>> >>>> 3,booking_type=one_way_trip & city=Pune & >>>> source=android,26.0,10.0,36.0,72.0 >>>> >>>> 5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled & >>>> source=ios & >>>> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0 >>>> >>>> 4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android & >>>> zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0 >>>> >>>> 5,booking_type=one_way_trip & city=Delhi NCR & >>>> pickUpHourOfDay=VERYLATE & source=ios & >>>> zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0 >>>> >>>> 3,booking_type=one_way_trip & sla=Scheduled & >>>> zone=205,14.0,9.0,23.0,61.0 >>>> >>>> I am aware from one of Lee's recent youtube videos that theta sketches >>>> work like Sets(Bit sets) when large enough and the data is small. My use >>>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not >>>> much higher. I could use the jaccard formula explained by Lee to get the >>>> margin of error. >>>> >>>> >>>> Sorry for the details but this should help you suggest if what I am >>>> doing is indeed what tuple sketches does on its own. I am using update >>>> sketches for merging using conjunction. Code is here >>>> *https://tinyurl.com/4pzfsj6v >>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>* >>>> >>>> >>>> >>>> >>>>> A query would choose the desired coordinates of the 3 dimensions. >>>>> If the query selects more than one coordinate from any one dimension, >>>>> then you would need to first merge the corresponding coordinate >>>>> sketches of that dimension together. >>>>> A missing dimension would mean first merging all coordinate sketches >>>>> of that dimension together. >>>>> The final result sketch is an Intersection of the resulting 3 >>>>> dimensional sketches. >>>>> >>>> >>>> Yes. the code here should serve as an example >>>> https://tinyurl.com/4pzfsj6v >>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$> >>>> >>>> >>>> >>>>> >>>>> With the resulting Tuple sketch you iterate over all the retained >>>>> items (Summaries) and sum each of the 3 fields. >>>>> Then you divide each sum by Theta. The result is the estimate of >>>>> Impressions, Clicks, and AdSpend >>>>> for the population of users selected by the query. The size of that >>>>> population would be obtained by getEstimate(). >>>>> >>>>> >>>> My question remains. Which base sketch type should I use for an >>>> additive metric like Ad-spend? Would that be a *quantile sketch*? >>>> >>> >>> I still think this question doesn't make sense. What are you trying to >>> understand with the sketch? Are you trying to find out the distribution of >>> adspend for a particular set of dimension values? Then that sounds like a >>> quantile sketch, Are you trying to find out how many different values of >>> adspend there were? That's a distinct counting sketch. Are you trying to >>> find out the most common ad spend values? That would be a frequent items >>> sketch. >>> >>> But really, none of those actually seem to match what you described >>> above. What you described above sounded like you were simply trying to >>> calculate the total AdSpend, which just requires sum, not a sketch. >>> >>>> >>>> >>>> >>>>> Be aware that Intersections, by their nature, can increase error >>>>> significantly. (see the Website). >>>>> Try to avoid doing intersections with sketch coordinates that have >>>>> very small populations, if possible. >>>>> >>>> >>>> Yes. However if I use very high nominal entries like 2^20 shouldn't >>>> this get mitigated? >>>> >>> >>> The point of using a sketch is estimation. If you can't tolerate any >>> error, you shouldn't be using the sketches in the first place. don't just >>> assume that you SHOULD use a high nominal entries. That said, when using >>> theta sketches and doing intersections, we do sometimes use values like >>> 2^20 for nominal entries, but often it is overkill. Remember, in the case >>> of intersections, you get high error when the intersection size is small >>> compared with the overall set sizes, The case where you see high relative >>> error are the cases where you are looking at small intersections. In most >>> use cases, a relative error of 1,000% or even 10,000% on an intersection >>> size of 10 when comparing sets of a billion is still a reasonable error >>> given how small the _actual_ value is. >>> >>>> >>>> >>>> >>>>> You can get a sense of how good your results are by utilizing the >>>>> getUpperBound() and getLowerBound() >>>>> (with respect to the user population.). The bigger that spread is, >>>>> the worse your estimates will be. >>>>> >>>>> Thanks. This is perfect. I do not need to use the Jaccard to get my >>>> estimate >>>> >>>> >>>> >>>>> The other option would be to do the cross-product upfront and create >>>>> 6000 Tuple sketches. >>>>> In this case the input tuple would feed only one sketch and the query >>>>> would be to only one sketch. >>>>> >>>> >>>> Since I am not using SQL queries and have my own unique use case, my >>>> questions are quite different. >>>> >>>> >>>>> This can quickly get unwieldy with more dimensions or high cardinality >>>>> dimensions. >>>>> It is more accurate because it avoids the intersections. It is also >>>>> possible to do a mix of >>>>> these two approaches, but you would need to think it through carefully. >>>>> >>>>> I hope this helps. >>>>> >>>>> Lee. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan < >>>>> vijay.sankar.ra...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> Just like theta sketches are used for distinct count metrics like >>>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be >>>>>> used for metrics like ad_spend? If so, what are the error bounds? >>>>>> >>>>>> There is a big opportunity that I see in storing very little data in >>>>>> sketches (which I use as a set) which makes retrieval of >>>>>> aggregate/analytic >>>>>> data very fast (although it is approximate). >>>>>> >>>>>> This question is best explained with an example. Say I have a fact >>>>>> table schema as follows >>>>>> >>>>>> >>>>>> 1. *Age_Groups* is a dimension with 10 bucketed distinct values >>>>>> {less than 18, 19-25, 26-35....} >>>>>> 2. *Gender* is a dimension with 3 distinct values F, M & Unknown >>>>>> 3. *Country* is a dimension with 200 possible values >>>>>> 4. *Impressions* is a metric which is a long count (perhaps with >>>>>> Theta sketch or HLL) >>>>>> 5. *Clicks* is a metric which is a long count (perhaps with Theta >>>>>> sketch or HLL) >>>>>> 6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps >>>>>> with quantile sketch??*) >>>>>> >>>>>> >>>>>> The maximum possible number of entries in this table would be the >>>>>> cross product of the cardinalities of the dimensions which is >>>>>> 10(Age_Group) >>>>>> x 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I >>>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for >>>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the >>>>>> group by queries using set operations that sketches provides. >>>>>> >>>>>> For impression metric, one sketch for Gender=F, another sketch for >>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions >>>>>> as >>>>>> well. >>>>>> For click metric, one sketch for Gender=F, another sketch for >>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions >>>>>> as >>>>>> well. >>>>>> Question is what sketch would I need for ad-spend? >>>>>> >>>>>> On a side note, I have used Theta sketches in this manner and even >>>>>> implemented ECLAT for frequent itemset computations and association rule >>>>>> mining ... example from another project below with theta sketches used >>>>>> for >>>>>> count, I do not know how to do the same for a metric like Ad-Spend. >>>>>> >>>>>> Level Conjunction Count >>>>>> 2 H10 & MemberGender=F 74 >>>>>> 2 M15 & MemberGender=F 83 >>>>>> 2 G31 & MemberGender=F 59 >>>>>> 2 MemberGender=F & R13 66 >>>>>> *In the example above, H10, M15 etc are International Disease codes >>>>>> for specific diseases. *https://www.aapc.com/codes/code-search/ >>>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$> >>>>>> >>>>>> Hope this is a clear representation of the problem. >>>>>> >>>>>> Regards >>>>>> Vijay >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>