Re: [E] Re: Using Quantile sketches for additive metrics

Alexander Saydakov Mon, 23 May 2022 09:49:07 -0700

This seems to be a convoluted way of computing the sum of all values. This
is an additive metric, easy to compute exactly, no sketches needed.


On Sun, May 22, 2022 at 5:56 AM vijay rajan <vijay.sankar.ra...@gmail.com>
wrote:

> Hi folks (And Lee),
>
> I think I have found what I was looking for in quantile sketches though I
> am not able to formulate error bounds for the same.
> I should have raised a PR request but I am going to write the code here.
> The code below estimates the volume of the quantile sketche based on the
> example
> https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html
> <https://urldefense.com/v3/__https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html__;!!Op6eflyXZCqGR5I!CL-lHnjU1bAay6Rk9U0ORoIKw-HDloG81s_Kve9k0arHZpZRyLYDHO0iJGTgIHqo-4SrlYoii5ErA1X6UVOv50QOcuzs$>
> .
>
> The assumption is that the values put in the sketch are all greater than
> 0. It kind of works like this.
> 1. Get 100 quantiles using getQuantiles.
> 2. Get Histogram percentage numbers
> 3. The average of the boundaries of the quantiles(getMinValue and
> getMaxValue for the extremities)  will be used with getN to estimate the
> volume.
>
> Code is straightforward. I think this was what I was looking for and
> honestly I do not care if the max and min values are really extreme
> compared to the rest of the numbers as my use case is supposed to remove
> outliers.
>
> private static double estimateVolume(DoublesSketch sketch) {
>     double retVal = 0.0;
>     int n = 100;
>
>     double [] quantiles = sketch.getQuantiles(n);
>     int totalItems = sketch.getRetainedItems();
>     double [] histogramCounts = new double[n];
>
>     //Arrays.sort(quantiles);
>     histogramCounts = sketch.getPMF(quantiles);
>
>     //compute the volume
>     retVal += histogramCounts[0] * sketch.getN() * ((sketch.getMinValue() +  
> quantiles[0]) / 2);
>     for (int i = 1; i < n-1; i++) {
>         retVal += histogramCounts[i] * sketch.getN() * (( quantiles[i-1]+  
> quantiles[i]) / 2);
>     }
>     retVal += histogramCounts[n-1] * sketch.getN() * ((sketch.getMaxValue() + 
>  quantiles[n-1]) / 2);
>
>     return retVal;
> }
>
>
>
> Regards
> Vijay
>
>
>
>
>
> On Mon, May 2, 2022 at 10:10 PM Will Lauer <wla...@yahooinc.com> wrote:
>
>> OK, this is interesting. I've got some concerns and questions that I've
>> put inline below.
>>
>> Will
>>
>>
>>
>> Will Lauer
>>
>>
>> Senior Principal Architect, Audience & Advertising Reporting
>>
>> Data Platforms & Systems Engineering
>>
>>
>> M 508 561 6427
>>
>> Champaign Office
>>
>> 1908 S. First St
>>
>> Champaign, IL 61822
>>
>>
>>
>> On Mon, May 2, 2022 at 9:19 AM vijay rajan <vijay.sankar.ra...@gmail.com>
>> wrote:
>>
>>> Thanks Lee. Please find my answers inline below in blue. I think you
>>> will find my use case very interesting. My next endeavor would be to make a
>>> decision tree with entropy / gini impurity measures with sketches. I am
>>> amazed at some of the results I have gotten. You may find this quite
>>> interesting.
>>>
>>> On Sun, May 1, 2022 at 12:55 AM leerho <lee...@gmail.com> wrote:
>>>
>>>> Vijay,
>>>> Sorry about the delay in getting back to you.
>>>>
>>>> There is some critical information missing from your description and
>>>> that is the domain of what you are sketching.
>>>> I presume that it is User-IDs, otherwise it doesn't make sense.
>>>> If this is the case I think the solution can be achieved in a couple of
>>>> ways.  Going forward with this assumption:
>>>>
>>>
>>>> Your raw data would look something like this:
>>>>
>>>>> {UserID, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>>
>>>>
>>>>
>>> Actually it would be an event_id instead of user_Id. The domain for the
>>> data is a hypothetical online advertising summary.
>>> {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}
>>>
>>> The problem that we faced in the past was that it was impossible to
>>> "join" a click to an impression. While the raw impression data would have a
>>> unique event_id the raw click data would also have its own event_id. The
>>> dimensions were the same for clicks and impressions. What we did back in
>>> thge day was aggregate the data for impressions separately and then
>>> aggregate for clicks separately and then join.
>>>
>>> You could assume the same here. Each dimension aggregate tuple
>>> {AgeGroup, Gender and Country in out case} would have a theta sketch for
>>> impressions and another for clicks.
>>>
>>
>> I don't see what a "theta sketch for impressions and another for clicks"
>> would mean here. Theta sketches are for doing distinct counts, so I could
>> understand a theta sketch for EventId, counting the number of distinct
>> event ids associated with the AgeGroup/Gender/Country combination. But that
>> doesn't seem to make sense for clicks or impressions. If your original data
>> was {EventId, AgeGroup, Gender, Country, Impressions, Clicks, AdSpend}, it
>> implies to me that clicks and impressions would always be 1 (that is, an
>> event is either a single click or single impression). When you drop the
>> EventId dimension, the natural aggregation for these (as well as for
>> AdSpend) is summation, giving you the total clicks, total impressions (and
>> total AdSpend) for the Age/Gender/Country combination. If you were to
>> generate a Theta sketch, it seems like it would be non-sensical, as you all
>> impressions or clicks inserted would be the same value and the theta sketch
>> would always return "1".
>>
>>>
>>>
>>>
>>>> A proposed solution would be to configure a TupleSketch with 3 fields:
>>>>  (Impressions, Clicks, AdSpend).
>>>> (This requires a little programming.  You would need to create classes
>>>> that implement the interfaces Summary,
>>>> SummaryFactory and SummarySetOperations.  There are examples of how to
>>>> do this on the website.)
>>>>
>>>>
>>> I will look into Tuple sketch and the SummaryFactory and Summary Set
>>> operations. I did try reading it but I cannot say I understood this
>>> completely. What type of sketch would we use for AdSpend? Would it be
>>> Quantile?
>>>
>>
>>>
>>>
>>>> Then you would process your raw data as follows:
>>>>
>>>>    - Partition your raw data into 213 streams defined by the three
>>>>    dimensions:
>>>>       - 10 X AgeGroup
>>>>       - 3 X Gender
>>>>       - 200 X Country
>>>>    - Each stream would feed a configured TupleSketch as above. (so you
>>>>    need only 213 sketches).
>>>>    - These sketches are in the form of a 3D hypercube.
>>>>    - Each input tuple would feed 3 sketches, one in each dimension.
>>>>
>>>>
>>> I believe I did something similar. Let me try explaining this with
>>> concrete examples
>>>
>>>    1. This is the raw data
>>>    
>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1zcwZB9KC7cASmM_8mvbPHPQaIeFJNrVw/view?usp=sharing__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlequTMOw$>(please
>>>    click on the link https://tinyurl.com/3d2r7xjh
>>>    
>>> <https://urldefense.com/v3/__https://tinyurl.com/3d2r7xjh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmz6G7UcA$>
>>>    ) for a company in India that provides car drivers for hire. The data has
>>>    most columns as categorical variables except the first which is tripId. 
>>> The
>>>    schema looks something like this
>>>       1. { 
>>> *tripId*,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>>>       *tripOutcome* }
>>>       2. tripId the first column, is distinct per row which I put into
>>>       theta sketches as shown later
>>>       3. tripOutcome the last column, is a classification variable
>>>       which has values "bad or good"
>>>       4. The problem statement is to dod data analysis of combinations
>>>       of dimensions that produce an unfavourable result that is
>>>       "tripOutcome=bad/(tripOutcome=bad + tripOutcome=good)" is high
>>>    2. I was able to populate theta sketches for each dimension and
>>>    value pair (example "city=Mumbai")  and store them as my effective data
>>>    representation.
>>>       1. with 10 as base and hence nominal entries as 2^10 the data is
>>>       https://tinyurl.com/bkvz4xrh
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/bkvz4xrh__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJmA8ikIbg$>
>>>
>>>       2. with 12 as base and hence nominal entries as 2^12 the data is
>>>       https://tinyurl.com/f9mzm8ef
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/f9mzm8ef__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJm3OdhKSA$>
>>>
>>>       3. with 14 as base and hence nominal entries as 2^14 the data is
>>>       https://tinyurl.com/2p8kwj3m
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>       4. The data looks like this
>>>          1. dimension=value, Base64 encoded Theta sketch formed from
>>>          unique tripId
>>>
>>> I'm not sure what you are sketching here. What is the entry being
>> inserted into the sketch? Basically, what are you counting? Normally, I
>> would think of sketching as taking your original data (  
>> tripId,source,estimated_usage_bins,city,zone,zone_demand_popularity,dayType,pickUpHourOfDay,sla,booking_type,
>> tripOutcome ) and picking a column (or columns) that you want to count
>> the distinct values of and generating a new table that contains the
>> dimensions you care about plus that new sketch. In this case, if you are
>> counting trips, the resulting table would look like {source,
>> estimate_usage_bins, city, zone, zone_demand_popularity, dayType,
>> pickupHourOfDay, sla, booking_type, tripOutcome, sketch(tripId)}. You could
>> then report on a variety of combinations by dropping dimensions and
>> aggregating (merging) the sketches.
>>
>>>
>>>    1. Next I run an Apriori like Frequent Itemset algorithm [ECLAT
>>>    https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17
>>>    
>>> <https://urldefense.com/v3/__https://towardsdatascience.com/the-eclat-algorithm-8ae3276d2d17__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnr167ghw$>
>>>    ] to generate a combination of "dimension=value" with a minimum support
>>>    threshold. The input to this implementation takes the files above at 
>>> point
>>>    2. https://tinyurl.com/2p8kwj3m
>>>    
>>> <https://urldefense.com/v3/__https://tinyurl.com/2p8kwj3m__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkD1YuIMA$>
>>>    2. I produced frequent itemsets for both sides i.e tripOutcome=bad
>>>    and tripOutcome=good and then join the results. The answers I get are
>>>    listed below
>>>       1. with input having dimension=value and theta sketch with 2^10
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/yc3a3rde
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/yc3a3rde__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnFg2WTnQ$>
>>>
>>>       2. with input having dimension=value and theta sketch with 2^12
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/5n7jjr32
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/5n7jjr32__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnpRcNAoQ$>
>>>       3. with input having dimension=value and theta sketch with 2^14
>>>       nominal entries answers can be found here
>>>       https://tinyurl.com/3wrxp4a7
>>>       
>>> <https://urldefense.com/v3/__https://tinyurl.com/3wrxp4a7__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJkCF5fBKg$>
>>>    3.  The results show that 2^14 (16K) nominal sketches give
>>>    immaculate results. There are errors in the results from the other 
>>> sketches
>>>    (RMSE).
>>>    4. Sample of results
>>>       1.
>>>
>>>       *level of combination, rule for combination, bad count, good
>>>       count, percentage of bad count*
>>>       2.
>>>
>>>       4,booking_type=one_way_trip & city=Pune & sla=Scheduled &
>>>       source=android,26.0,9.0,35.0,74.0
>>>
>>>       3,booking_type=one_way_trip & city=Pune &
>>>       source=android,26.0,10.0,36.0,72.0
>>>
>>>       5,city=Delhi NCR & pickUpHourOfDay=VERYLATE & sla=Scheduled &
>>>       source=ios & 
>>> zone_demand_popularity=POPULARITY_INDEX_4,17.0,9.0,26.0,65.0
>>>
>>>       4,city=Delhi NCR & pickUpHourOfDay=VERYLATE & source=android &
>>>       zone_demand_popularity=POPULARITY_INDEX_5,15.0,9.0,24.0,63.0
>>>
>>>       5,booking_type=one_way_trip & city=Delhi NCR &
>>>       pickUpHourOfDay=VERYLATE & source=ios &
>>>       zone_demand_popularity=POPULARITY_INDEX_4,16.0,10.0,26.0,62.0
>>>
>>>       3,booking_type=one_way_trip & sla=Scheduled &
>>>       zone=205,14.0,9.0,23.0,61.0
>>>
>>> I am aware from one of Lee's recent youtube videos that theta sketches
>>> work like Sets(Bit sets) when large enough and the data is small. My use
>>> case is fine with a count of 100 being reported as 80 or 120 (20%) but not
>>> much higher. I could use the jaccard formula explained by Lee to get the
>>> margin of error.
>>>
>>>
>>> Sorry for the details but this should help you suggest if what I am
>>> doing is indeed what tuple sketches does on its own. I am using update
>>> sketches for merging using conjunction. Code is here 
>>> *https://tinyurl.com/4pzfsj6v
>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>*
>>>
>>>
>>>
>>>
>>>> A query would choose the desired coordinates of the 3 dimensions.
>>>> If the query selects more than one coordinate from any one dimension,
>>>> then you would need to first merge the corresponding coordinate
>>>> sketches of that dimension together.
>>>> A missing dimension would mean first merging all coordinate sketches of
>>>> that dimension together.
>>>> The final result sketch is an Intersection of the resulting 3
>>>> dimensional sketches.
>>>>
>>>
>>> Yes. the code here should serve as an example
>>> https://tinyurl.com/4pzfsj6v
>>> <https://urldefense.com/v3/__https://tinyurl.com/4pzfsj6v__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJlcvki1Mg$>
>>>
>>>
>>>
>>>>
>>>> With the resulting Tuple sketch you iterate over all the retained items
>>>> (Summaries) and sum each of the 3 fields.
>>>> Then you divide each sum by Theta.  The result is the estimate of
>>>> Impressions, Clicks, and AdSpend
>>>> for the population of users selected by the query.  The size of that
>>>> population would be obtained by getEstimate().
>>>>
>>>>
>>> My question remains. Which base sketch type should I use for an additive
>>> metric like Ad-spend? Would that be a *quantile sketch*?
>>>
>>
>> I still think this question doesn't make sense. What are you trying to
>> understand with the sketch? Are you trying to find out the distribution of
>> adspend for a particular set of dimension values? Then that sounds like a
>> quantile sketch, Are you trying to find out how many different values of
>> adspend there were? That's a distinct counting sketch. Are you trying to
>> find out the most common ad spend values? That would be a frequent items
>> sketch.
>>
>> But really, none of those actually seem to match what you described
>> above. What you described above sounded like you were simply trying to
>> calculate the total AdSpend, which just requires sum, not a sketch.
>>
>>>
>>>
>>>
>>>> Be aware that Intersections, by their nature, can increase error
>>>> significantly. (see the Website).
>>>> Try to avoid doing intersections with sketch coordinates that have very
>>>> small populations, if possible.
>>>>
>>>
>>> Yes. However if I use very high nominal entries like 2^20 shouldn't this
>>> get mitigated?
>>>
>>
>> The point of using a sketch is estimation. If you can't tolerate any
>> error, you shouldn't be using the sketches in the first place. don't just
>> assume that you SHOULD use a high nominal entries. That said, when using
>> theta sketches and doing intersections, we do sometimes use values like
>> 2^20 for nominal entries, but often it is overkill. Remember, in the case
>> of intersections, you get high error when the intersection size is small
>> compared with the overall set sizes, The case where you see high relative
>> error are the cases where you are looking at small intersections. In most
>> use cases, a relative error of 1,000% or even 10,000%  on an intersection
>> size of 10 when comparing sets of a billion is still a reasonable error
>> given how small the _actual_ value is.
>>
>>>
>>>
>>>
>>>> You can get a sense of how good your results are by utilizing the
>>>> getUpperBound() and getLowerBound()
>>>> (with respect to the user population.).  The bigger that spread is, the
>>>> worse your estimates will be.
>>>>
>>>> Thanks. This is perfect. I do not need to use the Jaccard to get my
>>> estimate
>>>
>>>
>>>
>>>> The other option would be to do the cross-product upfront and create
>>>> 6000 Tuple sketches.
>>>> In this case the input tuple would feed only one sketch and the query
>>>> would be to only one sketch.
>>>>
>>>
>>> Since I am not using SQL queries and have my own unique use case, my
>>> questions are quite different.
>>>
>>>
>>>> This can quickly get unwieldy with more dimensions or high cardinality
>>>> dimensions.
>>>> It is more accurate because it avoids the intersections.  It is also
>>>> possible to do a mix of
>>>> these two approaches, but you would need to think it through carefully.
>>>>
>>>> I hope this helps.
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 28, 2022 at 1:35 AM vijay rajan <
>>>> vijay.sankar.ra...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> Just like theta sketches are used for distinct count metrics like
>>>>> impressions and clicks, is there a sketch (perhaps quantile?) that can be
>>>>> used for metrics like ad_spend? If so, what are the error bounds?
>>>>>
>>>>> There is a big opportunity that I see in storing very little data in
>>>>> sketches (which I use as a set) which makes retrieval of 
>>>>> aggregate/analytic
>>>>> data very fast (although it is approximate).
>>>>>
>>>>> This question is best explained with an example. Say I have a fact
>>>>> table schema as follows
>>>>>
>>>>>
>>>>>    1. *Age_Groups* is a dimension with 10 bucketed distinct values
>>>>>    {less than 18, 19-25, 26-35....}
>>>>>    2. *Gender* is a dimension with 3 distinct values F, M & Unknown
>>>>>    3. *Country* is a  dimension with 200 possible values
>>>>>    4. *Impressions* is a metric which is a long count (perhaps with
>>>>>    Theta sketch or HLL)
>>>>>    5. *Clicks* is a metric which is a long count (perhaps with Theta
>>>>>    sketch or HLL)
>>>>>    6. *Ad-Spend* is a metric which is a double which is a sum (*perhaps
>>>>>    with quantile sketch??*)
>>>>>
>>>>>
>>>>> The maximum possible number of entries in this table would be the
>>>>> cross product of the cardinalities of the dimensions which is 
>>>>> 10(Age_Group)
>>>>> x 3(Gender) x 200(Country) = 6000. Now instead of storing 6000 records, I
>>>>> could store only (10 + 3 + 200) * *3 sketches(**one each for
>>>>> impression, clicks and Ad-Spend)* = 639 sketches and accomplish the
>>>>> group by queries using set operations that sketches provides.
>>>>>
>>>>> For impression metric, one sketch for Gender=F, another sketch for
>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>> well.
>>>>> For click metric, one sketch for Gender=F, another sketch for
>>>>> Gender=M, yet another for Gender=Unknown and so on for other dimensions as
>>>>> well.
>>>>> Question is what sketch would I need for ad-spend?
>>>>>
>>>>> On a side note, I have used Theta sketches in this manner and even
>>>>> implemented ECLAT for frequent itemset computations and association rule
>>>>> mining ... example from another project below with theta sketches used for
>>>>> count, I do not know how to do the same for a metric like Ad-Spend.
>>>>>
>>>>> Level Conjunction Count
>>>>> 2 H10 & MemberGender=F 74
>>>>> 2 M15 & MemberGender=F 83
>>>>> 2 G31 & MemberGender=F 59
>>>>> 2 MemberGender=F & R13 66
>>>>> *In the example above, H10, M15 etc are International Disease codes
>>>>> for specific diseases. *https://www.aapc.com/codes/code-search/
>>>>> <https://urldefense.com/v3/__https://www.aapc.com/codes/code-search/__;!!Op6eflyXZCqGR5I!HkafhezcasvOAcC_LolHeVdS4lQ6RK0hiMgY3wnIiCdOcLxN0a6Xl0vEMxfPXHDWKWlP4zU41o3xAojDzS2zhJnBPvqOMw$>
>>>>>
>>>>> Hope this is a clear representation of the problem.
>>>>>
>>>>> Regards
>>>>> Vijay
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: [E] Re: Using Quantile sketches for additive metrics

Reply via email to