This seems to be a convoluted way of computing the sum of all values. This is an additive metric, easy to compute exactly, no sketches needed.
On Sun, May 22, 2022 at 5:56 AM vijay rajan <vijay.sankar.ra...@gmail.com> wrote: > Hi folks (And Lee), > > I think I have found what I was looking for in quantile sketches though I > am not able to formulate error bounds for the same. > I should have raised a PR request but I am going to write the code here. > The code below estimates the volume of the quantile sketche based on the > example > https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html > <https://urldefense.com/v3/__https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html__;!!Op6eflyXZCqGR5I!CL-lHnjU1bAay6Rk9U0ORoIKw-HDloG81s_Kve9k0arHZpZRyLYDHO0iJGTgIHqo-4SrlYoii5ErA1X6UVOv50QOcuzs$> > . > > The assumption is that the values put in the sketch are all greater than > 0. It kind of works like this. > 1. Get 100 quantiles using getQuantiles. > 2. Get Histogram percentage numbers > 3. The average of the boundaries of the quantiles(getMinValue and > getMaxValue for the extremities) will be used with getN to estimate the > volume. > > Code is straightforward. I think this was what I was looking for and > honestly I do not care if the max and min values are really extreme > compared to the rest of the numbers as my use case is supposed to remove > outliers. > > private static double estimateVolume(DoublesSketch sketch) { > double retVal = 0.0; > int n = 100; > > double [] quantiles = sketch.getQuantiles(n); > int totalItems = sketch.getRetainedItems(); > double [] histogramCounts = new double[n]; > > //Arrays.sort(quantiles); > histogramCounts = sketch.getPMF(quantiles); > > //compute the volume > retVal += histogramCounts[0] * sketch.getN() * ((sketch.getMinValue() + > quantiles[0]) / 2); > for (int i = 1; i < n-1; i++) { > retVal += histogramCounts[i] * sketch.getN() * (( quantiles[i-1]+ > quantiles[i]) / 2); > } > retVal += histogramCounts[n-1] * sketch.getN() * ((sketch.getMaxValue() + > quantiles[n-1]) / 2); > > return retVal; > } > > > > Regards > Vijay > > > > > > On Mon, May 2, 2022 at 10:10 PM Will Lauer <wla...@yahooinc.com> wrote: > >> OK, this is interesting. I've got some concerns and questions that I've >> put inline below. >> >> Will >> >> >> >> Will Lauer >> >> >> Senior Principal Architect, Audience & Advertising Reporting >> >> Data Platforms & Systems Engineering >> >> >> M 508 561 6427 >> >> Champaign Office >> >> 1908 S. First St >> >> Champaign, IL 61822 >> >> >> >> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com> >> wrote: >> >>> Thanks Lee. Please find my answers inline below in blue. I think you >>> will find my use case very interesting. My next endeavor would be to make a >>> decision tree with entropy / gini impurity measures with sketches. I am >>> amazed at some of the results I have gotten. You may find this quite >>> interesting. >>> >>> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote: >>> >>>> Vijay, >>>> Sorry about the delay in getting back to you. >>>> >>>> There is some critical information missing from your description and >>>> that is the domain of what you are sketching. >>>> I presume that it is User-IDs, otherwise it doesn't make sense. >>>> If this is the case I think the solution can be achieved in a couple of >>>> ways. Going forward with this assumption: >>>> >>> >>>> Your raw data would look something like this: >>>> >>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>>> >>>> >>>> >>> Actually it would be an event_id instead of user_Id. The domain for the >>> data is a hypothetical online advertising summary. >>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend} >>> >>> The problem that we faced in the past was that it was impossible to >>> "join" a click to an impression. While the raw impression data would have a >>> unique event_id the raw click data would also have its own event_id. The >>> dimensions were the same for clicks and impressions. What we did back in >>> thge day was aggregate the data for impressions separately and then >>> aggregate for clicks separately and then join. >>> >>> You could assume the same here. Each dimension aggregate tuple >>> {AgeGroup, Gender and Country in out case} would have a theta sketch for >>> impressions and another for clicks. >>> >> >> I don't see what a "theta sketch for impressions and another for clicks" >> would mean here. Theta sketches are for doing distinct counts, so I could >> understand a theta sketch for EventId, counting the number of distinct >> event ids associated with the AgeGroup/Gender/Country combination. But that >> doesn't seem to make sense for clicks or impressions. If your original data >> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it >> implies to me that clicks and impressions would always be 1 (that is, an >> event is either a single click or single impression). When you drop the >> EventId dimension, the natural aggregation for these (as well as for >> AdSpend) is summation, giving you the total clicks, total impressions (and >> total AdSpend) for the Age/Gender/Country combination. If you were to >> generate a Theta sketch, it seems like it would be non-sensical, as you all >> impressions or clicks inserted would be the same value and the theta sketch >> would always return "1". >> >>> >>> >>> >>>> A proposed solution would be to configure a TupleSketch with 3 fields: >>>> (Impressions, Clicks, AdSpend). >>>> (This requires a little programming. You would need to create classes >>>> that implement the interfaces Summary, >>>> SummaryFactory and SummarySetOperations. There are examples of how to >>>> do this on the website.) >>>> >>>> >>> I will look into Tuple sketch and the SummaryFactory and Summary Set >>> operations. I did try reading it but I cannot say I understood this >>> completely. What type of sketch would we use for AdSpend? Would it be >>> Quantile? >>> >> >>> >>> >>>> Then you would process your raw data as follows: >>>> >>>> - Partition your raw data into 213 streams defined by the three >>>> dimensions: >>>> - 10 X AgeGroup >>>> - 3 X Gender >>>> - 200 X Country >>>> - Each stream would feed a configured TupleSketch as above. (so you >>>> need only 213 sketches). >>>> - These sketches are in the form of a 3D hypercube. >>>> - Each input tuple would feed 3 sketches, one in each dimension. >>>> >>>> >>> I believe I did something similar. Let me try explaining this with >>> concrete examples >>> >>> 1. This is the raw data >>> >>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please >>> click on the link https://tinyurl.com/3d2r7xjh >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$> >>> ) for a company in India that provides car drivers for hire. The data has >>> most columns as categorical variables except the first which is tripId. >>> The >>> schema looks something like this >>> 1. { >>> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >>> *tripOutcome* } >>> 2. tripId the first column, is distinct per row which I put into >>> theta sketches as shown later >>> 3. tripOutcome the last column, is a classification variable >>> which has values "bad or good" >>> 4. The problem statement is to dod data analysis of combinations >>> of dimensions that produce an unfavourable result that is >>> "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high >>> 2. I was able to populate theta sketches for each dimension and >>> value pair (example "city=Mumbai") and store them as my effective data >>> representation. >>> 1. with 10 as base and hence nominal entries as 2^10 the data is >>> https://tinyurl.com/bkvz4xrh >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$> >>> >>> 2. with 12 as base and hence nominal entries as 2^12 the data is >>> https://tinyurl.com/f9mzm8ef >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$> >>> >>> 3. with 14 as base and hence nominal entries as 2^14 the data is >>> https://tinyurl.com/2p8kwj3m >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >>> 4. The data looks like this >>> 1. dimension=value, Base64 encoded Theta sketch formed from >>> unique tripId >>> >>> I'm not sure what you are sketching here. What is the entry being >> inserted into the sketch? Basically, what are you counting? Normally, I >> would think of sketching as taking your original data ( >> tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type, >> tripOutcome ) and picking a column (or columns) that you want to count >> the distinct values of and generating a new table that contains the >> dimensions you care about plus that new sketch. In this case, if you are >> counting trips, the resulting table would look like {source, >> estimate_usage_bins, city, zone, zone_demand_popularity, dayType, >> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could >> then report on a variety of combinations by dropping dimensions and >> aggregating (merging) the sketches. >> >>> >>> 1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT >>> https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17 >>> >>> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$> >>> ] to generate a combination of "dimension=value" with a minimum support >>> threshold. The input to this implementation takes the files above at >>> point >>> 2. https://tinyurl.com/2p8kwj3m >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$> >>> 2. I produced frequent itemsets for both sides i.e tripOutcome=bad >>> and tripOutcome=good and then join the results. The answers I get are >>> listed below >>> 1. with input having dimension=value and theta sketch with 2^10 >>> nominal entries answers can be found here >>> https://tinyurl.com/yc3a3rde >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$> >>> >>> 2. with input having dimension=value and theta sketch with 2^12 >>> nominal entries answers can be found here >>> https://tinyurl.com/5n7jjr32 >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$> >>> 3. with input having dimension=value and theta sketch with 2^14 >>> nominal entries answers can be found here >>> https://tinyurl.com/3wrxp4a7 >>> >>> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$> >>> 3. The results show that 2^14 (16K) nominal sketches give >>> immaculate results. There are errors in the results from the other >>> sketches >>> (RMSE). >>> 4. Sample of results >>> 1. >>> >>> *level of combination, rule for combination, bad count, good >>> count, percentage of bad count* >>> 2. >>> >>> 4,booking_type=one_way_trip & city=Pune & sla=Scheduled & >>> source=android,26.0,9.0,35.0,74.0 >>> >>> 3,booking_type=one_way_trip & city=Pune & >>> source=android,26.0,10.0,36.0,72.0 >>> >>> 5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled & >>> source=ios & >>> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0 >>> >>> 4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android & >>> zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0 >>> >>> 5,booking_type=one_way_trip & city=Delhi NCR & >>> pickUpHourOfDay=VERYLATE & source=ios & >>> zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0 >>> >>> 3,booking_type=one_way_trip & sla=Scheduled & >>> zone=205,14.0,9.0,23.0,61.0 >>> >>> I am aware from one of Lee's recent youtube videos that theta sketches >>> work like Sets(Bit sets) when large enough and the data is small. My use >>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not >>> much higher. I could use the jaccard formula explained by Lee to get the >>> margin of error. >>> >>> >>> Sorry for the details but this should help you suggest if what I am >>> doing is indeed what tuple sketches does on its own. I am using update >>> sketches for merging using conjunction. Code is here >>> *https://tinyurl.com/4pzfsj6v >>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>* >>> >>> >>> >>> >>>> A query would choose the desired coordinates of the 3 dimensions. >>>> If the query selects more than one coordinate from any one dimension, >>>> then you would need to first merge the corresponding coordinate >>>> sketches of that dimension together. >>>> A missing dimension would mean first merging all coordinate sketches of >>>> that dimension together. >>>> The final result sketch is an Intersection of the resulting 3 >>>> dimensional sketches. >>>> >>> >>> Yes. the code here should serve as an example >>> https://tinyurl.com/4pzfsj6v >>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$> >>> >>> >>> >>>> >>>> With the resulting Tuple sketch you iterate over all the retained items >>>> (Summaries) and sum each of the 3 fields. >>>> Then you divide each sum by Theta. The result is the estimate of >>>> Impressions, Clicks, and AdSpend >>>> for the population of users selected by the query. The size of that >>>> population would be obtained by getEstimate(). >>>> >>>> >>> My question remains. Which base sketch type should I use for an additive >>> metric like Ad-spend? Would that be a *quantile sketch*? >>> >> >> I still think this question doesn't make sense. What are you trying to >> understand with the sketch? Are you trying to find out the distribution of >> adspend for a particular set of dimension values? Then that sounds like a >> quantile sketch, Are you trying to find out how many different values of >> adspend there were? That's a distinct counting sketch. Are you trying to >> find out the most common ad spend values? That would be a frequent items >> sketch. >> >> But really, none of those actually seem to match what you described >> above. What you described above sounded like you were simply trying to >> calculate the total AdSpend, which just requires sum, not a sketch. >> >>> >>> >>> >>>> Be aware that Intersections, by their nature, can increase error >>>> significantly. (see the Website). >>>> Try to avoid doing intersections with sketch coordinates that have very >>>> small populations, if possible. >>>> >>> >>> Yes. However if I use very high nominal entries like 2^20 shouldn't this >>> get mitigated? >>> >> >> The point of using a sketch is estimation. If you can't tolerate any >> error, you shouldn't be using the sketches in the first place. don't just >> assume that you SHOULD use a high nominal entries. That said, when using >> theta sketches and doing intersections, we do sometimes use values like >> 2^20 for nominal entries, but often it is overkill. Remember, in the case >> of intersections, you get high error when the intersection size is small >> compared with the overall set sizes, The case where you see high relative >> error are the cases where you are looking at small intersections. In most >> use cases, a relative error of 1,000% or even 10,000% on an intersection >> size of 10 when comparing sets of a billion is still a reasonable error >> given how small the _actual_ value is. >> >>> >>> >>> >>>> You can get a sense of how good your results are by utilizing the >>>> getUpperBound() and getLowerBound() >>>> (with respect to the user population.). The bigger that spread is, the >>>> worse your estimates will be. >>>> >>>> Thanks. This is perfect. I do not need to use the Jaccard to get my >>> estimate >>> >>> >>> >>>> The other option would be to do the cross-product upfront and create >>>> 6000 Tuple sketches. >>>> In this case the input tuple would feed only one sketch and the query >>>> would be to only one sketch. >>>> >>> >>> Since I am not using SQL queries and have my own unique use case, my >>> questions are quite different. >>> >>> >>>> This can quickly get unwieldy with more dimensions or high cardinality >>>> dimensions. >>>> It is more accurate because it avoids the intersections. It is also >>>> possible to do a mix of >>>> these two approaches, but you would need to think it through carefully. >>>> >>>> I hope this helps. >>>> >>>> Lee. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan < >>>> vijay.sankar.ra...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> Just like theta sketches are used for distinct count metrics like >>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be >>>>> used for metrics like ad_spend? If so, what are the error bounds? >>>>> >>>>> There is a big opportunity that I see in storing very little data in >>>>> sketches (which I use as a set) which makes retrieval of >>>>> aggregate/analytic >>>>> data very fast (although it is approximate). >>>>> >>>>> This question is best explained with an example. Say I have a fact >>>>> table schema as follows >>>>> >>>>> >>>>> 1. *Age_Groups* is a dimension with 10 bucketed distinct values >>>>> {less than 18, 19-25, 26-35....} >>>>> 2. *Gender* is a dimension with 3 distinct values F, M & Unknown >>>>> 3. *Country* is a dimension with 200 possible values >>>>> 4. *Impressions* is a metric which is a long count (perhaps with >>>>> Theta sketch or HLL) >>>>> 5. *Clicks* is a metric which is a long count (perhaps with Theta >>>>> sketch or HLL) >>>>> 6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps >>>>> with quantile sketch??*) >>>>> >>>>> >>>>> The maximum possible number of entries in this table would be the >>>>> cross product of the cardinalities of the dimensions which is >>>>> 10(Age_Group) >>>>> x 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I >>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for >>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the >>>>> group by queries using set operations that sketches provides. >>>>> >>>>> For impression metric, one sketch for Gender=F, another sketch for >>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as >>>>> well. >>>>> For click metric, one sketch for Gender=F, another sketch for >>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as >>>>> well. >>>>> Question is what sketch would I need for ad-spend? >>>>> >>>>> On a side note, I have used Theta sketches in this manner and even >>>>> implemented ECLAT for frequent itemset computations and association rule >>>>> mining ... example from another project below with theta sketches used for >>>>> count, I do not know how to do the same for a metric like Ad-Spend. >>>>> >>>>> Level Conjunction Count >>>>> 2 H10 & MemberGender=F 74 >>>>> 2 M15 & MemberGender=F 83 >>>>> 2 G31 & MemberGender=F 59 >>>>> 2 MemberGender=F & R13 66 >>>>> *In the example above, H10, M15 etc are International Disease codes >>>>> for specific diseases. *https://www.aapc.com/codes/code-search/ >>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$> >>>>> >>>>> Hope this is a clear representation of the problem. >>>>> >>>>> Regards >>>>> Vijay >>>>> >>>>> >>>>> >>>>> >>>>>