Re: Regarding error bounds and confidence of apache KLL implementation

Alexander Saydakov Tue, 23 Jun 2020 11:49:02 -0700

Perhaps, it would be helpful to look at how we measure the error of the
actual implementation to make sure it behaves as we expect according to the
theory. Some information including a plot is available here:
http://datasketches.apache.org/docs/Quantiles/KLLSketch.html


The characterization code, some measurements and Matlab scripts to plot the
results are available here:

https://github.com/apache/incubator-datasketches-characterization

On Tue, Jun 23, 2020 at 11:30 AM Justin Thaler <justin.r.tha...@gmail.com>
wrote:

> Hi Gourav,
>
> > Q1. From users perspective, can you tell apriori whenever the sketch
> constructed falls in the 1% category (bad case),
>       or maybe when the users asks a get_quantile query give some sort of
> indication around confidence or error bound for that rank query.
>        As a user, when I am relying on this algorithm how can I be sure
> that I am not in the 1% case.
>        If not, then as we don't have any bounds on rank in the bad case,
> will users be skeptical in using it?
>
> Before I answer this question directly, perhaps it will be helpful if I
> give more information about what is known about the error distribution of
> the KLL sketch.
>
> For any fixed universe item y, the estimated rank \hat{R}(y) of y derived
> from the KLL sketch will have essentially Gaussian error.
>
> More precisely, if R(y) is the true rank of y and \hat{R}(y) is the
> estimated rank of y as determined by the sketch, then Err(y) := \hat{R}(y)
> - R(y) is essentially a 0-mean Gaussian with bounded variance.
>
> (More precisely still, Err(y) is a weighted sum of Binomial random
> variables, and this sum is understood to behave very much like a 0-mean
> Gaussian).
>
> The KLL paper shows that the standard deviation of Err(y) is low---at most
> roughly epsilon * n, where n is stream length. This is low enough that one
> can conclude that with probability at least 1-delta over the randomness
> used within the generation of the sketch, |Err(y)| is at most epsilon * n.
>
> The KLL paper itself states theorems of the form "With probability at
> least 1-delta, |Err(y)| is at most epsilon * n". This kind of statement
> does not directly refer to Gaussianity of the error distribution for the
> estimated rank of y. But even though the statement in quotes doesn't assert
> Gaussianity, Gaussianity indeed holds and was actually critical to proving
> the statement in quotes.
>
> To be clear, if y and y' are two different items, then the estimates
> \hat{R}(y) and \hat{R}(y') returned by a KLL sketch are *not* independent.
> Both estimates are roughly Gaussian-distributed individually, but these two
> Gaussians are correlated.
>
> Finally (and related to the above comment about estimates for different y
> and y' being correlated), the KLL paper concludes that Err(y) is likely to
> be low (at most epsilon*n) for all items y simultaneously via what is
> called an "epsilon-net argument". Essentially this argument says that, so
> long as the sketch returns accurate estimates for ~1/epsilon items of
> "evenly spaced ranks" (i.e., ranks epsilon*n, 2*epsilon*n, 3*epsilon*n,
> ..., n) then the sketch will return accurate estimates for all possible
> queries simultaneously. By what's known as a union bound, the probability
> that the sketch is inaccurate for any of these 1/epsilon items is at most a
> 1/epsilon factor larger than the probability it is inaccurate on any single
> fixed query. The space usage of the KLL sketch, namely O((1/epsilon) *
> sqrt{log(1/delta)}), grows sufficiently slowly with the parameter delta
> that decreasing delta by a factor of epsilon (to account for the
> aforementioned probability increase that comes from the union bound) has a
> negligible effect on the algorithm's space usage, at least so long as delta
> is less than epsilon.
>
> Now, regarding your question:
>
> Let's imagine the user just asks a single query, say, "give me the
> estimated rank \hat{R}(y) of some specific item y that I care about".
>
> Then the sketch returns the estimate \hat{R}(y), which as I've explained
> is essentially a Gaussian random variable with expected value equal to the
> true rank R(y), and standard deviation at most epsilon*n.
>
> If the estimate happens to wind up way out in the tail of that random
> variable, the error will be large, and there's no way for the user to
> recognize that situation. That is, the user can completely understand the
> *distribution* of the error in the estimate, but won't have any way of
> knowing for any draw from that distribution what the error actually is.
>
> That said, there are standard techniques to increase confidence in the
> estimates, though these come at a price. For example, the user could
> simultaneously compute several (say 5 for concreteness) different
> independent copies of the sketch, and answer any query by taking the median
> of the 5 estimates returned by the 5 independent sketches. It is known that
> the probability the median of the estimates is "bad" is significantly less
> than the probability any individual estimate is "bad".
>
> On Tue, Jun 23, 2020 at 1:20 AM leerho <lee...@gmail.com> wrote:
>
>> Gourav,
>>
>> I will defer the answer to Q1 to either Justin or Edo, but the rest I can
>> answer.
>>
>> Q2: 99% was somewhat arbitrary and a compromise.  It took a huge amount
>> of carefully constructed characterization studies to establish the 99%
>> confidence level and to refine it to a much finer level was left to a
>> future task.
>>
>> Q3: The minimum width set to 8 is also speed / space compromise.  The
>> paper suggests a minimum of 2, but with only 2 values per level the process
>> resembles random sampling, which the paper also discusses.  However.
>> Installing a sampler at the tail was going to add quite a bit of complexity
>> to the code especially for merging.  And frankly, at the time, we were not
>> confident that merging with the sampler at the tail would be as robust.
>> Again, this was left to a future algorithm that could be proven to work in
>> practical situations.  Also a long tail of compactors of size 2 is
>> computationally very wasteful and slow.  By widening the tail to 8 we were
>> able to recover nearly all the speed of the older quantiles algorithm with
>> most of the benefit of the size reduction of KLL.   This is another example
>> of an engineering design decision where we were particularly concerned
>> about speed performance.  This was not arbitrary,  we did extensive
>> characterization studies of various widths: 2,4,6,8,10,12 and 8 seemed to
>> be the happy medium between speed, space and simplicity of implementation.
>>
>> Q4.  This is an easy question, but clearly not obvious.  The whole idea
>> of lazy promotion of the compactors is to keep as much data as possible at
>> the low levels as they have the most accurate representation of the input
>> stream.  If you focus on promoting the higher levels first you end up
>> degrading your accuracy faster.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Mon, Jun 22, 2020 at 9:43 PM Gourav Kumar <gourav1...@gmail.com>
>> wrote:
>>
>>> Thanks Lee,
>>>
>>> This clears a lot of things.
>>> Although, I have few more questions on this.
>>>
>>> Q1. From users perspective, can you tell apriori whenever the sketch
>>> constructed falls in the 1% category (bad case),
>>>       or maybe when the users asks a get_quantile query give some sort
>>> of indication around confidence or error bound for that rank query.
>>>        As a user, when I am relying on this algorithm how can I be sure
>>> that I am not in the 1% case.
>>>        If not, then as we don't have any bounds on rank in the bad case,
>>> will users be skeptical in using it?
>>>
>>> Q2. How did you arrive at the confidence value of 99%?
>>>       The paper mentions some formula for delta, but it's not clear to
>>> me if it will translate to 99%.
>>>
>>> Q3. Why is min width's(m_) default value set to 8?
>>>        For compaction to be meaningful, In my opinion having a min width
>>> value greater than 1 would have been sufficient.
>>>        The other reason which comes to my mind is faster convergence and
>>> less number of compaction.
>>>        In that case why 8, why not any value larger than that too?
>>>
>>> Q4. Also in compaction, you always pick the first level which stores
>>> more elements than its capacity.
>>>        Would guarantees be affected or will it have any effect on the
>>> speed of the algorithm when you change this strategy to
>>>         lets say picking a level which overshoots its capacity by the
>>> largest amount?
>>>         In my opinion, this strategy would free up more space and should
>>> lead to lesser number of compaction and hence decreasing empirical error
>>> bounds.
>>>
>>> Thank You,
>>> Gourav
>>>
>>> On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov <
>>> sayda...@verizonmedia.com> wrote:
>>>
>>>> Adding the original poster just in case he is not subscribed to the list
>>>>
>>>> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:
>>>>
>>>>> I see a typo: What I called the Omega relation is actually Omicron
>>>>> (big "Oh")  :)
>>>>>
>>>>>
>>>>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <
>>>>> justin.r.tha...@gmail.com> wrote:
>>>>>
>>>>>> Your response looks great to me.
>>>>>>
>>>>>> I suppose Gourav will respond if he is still confused on any issues.
>>>>>> And hopefully Edo will read this over at some point and double-check both
>>>>>> of our work :-)
>>>>>>
>>>>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Gourav, welcome to this forum!
>>>>>>>
>>>>>>> I want to make sure you have access to and have read the code
>>>>>>> documentation for the KLL sketch in addition to the papers.  Although 
>>>>>>> the
>>>>>>> code documentation exists for both Java and C++, it is a little easier 
>>>>>>> to
>>>>>>> access the Javadocs as they are accessible from the website.  There is a
>>>>>>> long section about the KLL sketch
>>>>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>>>>>>> and its implementation that may answer some of your questions.
>>>>>>>
>>>>>>> It is important to emphasize that the error of the quantile sketch
>>>>>>> is only relevant to the rank and not the value. It is not clear from 
>>>>>>> your
>>>>>>> question, which you are referring to.
>>>>>>>
>>>>>>> 1. I asked Justin Thaler, one of our scientists, how to better
>>>>>>> explain the error "guarantee" on the sketch.  Hopefully, he can respond
>>>>>>> here if what I write here is not totally correct :)
>>>>>>>
>>>>>>> When you feed a stream of values into a KLL (or Quantiles) sketch
>>>>>>> the result is actually a set of values (kept internally by the sketch) 
>>>>>>> that
>>>>>>> have been "sampled" from the stream using a random process. Let's call 
>>>>>>> this
>>>>>>> set a "*result set*".  When we issue a query against the sketch it
>>>>>>> analyzes this *result set* and returns a value (or a rank depending
>>>>>>> on the query).
>>>>>>>
>>>>>>> The guarantee goes something like this:
>>>>>>>
>>>>>>> With 99% probability, the *result set* of a sketch will be a set
>>>>>>> such that a *getQuantile(rank)* query from that *result set* will
>>>>>>> produce an *estimated quantile (or value)* such that the *true rank
>>>>>>> of the true value* will be within +/- the absolute fractional rank
>>>>>>> error returned by the function getNormalizedRankError(k, pmf).  If the 
>>>>>>> *result
>>>>>>> set* is a good one, i.e., one of the good 99%- of-the-time
>>>>>>> sketches, then the guarantee is that *all possible queries* will
>>>>>>> fall within the +/- rank error interval defined above.
>>>>>>>
>>>>>>> Because of the random sampling process used by the sketch, there is
>>>>>>> a 1% chance that the *result set* is bad.  We don't know how bad:
>>>>>>> it could mess up only one query, or it could mess up all queries from 
>>>>>>> that
>>>>>>> bad *result set*.
>>>>>>>
>>>>>>> This applies to both of your Q1 scenarios, given the caveat above
>>>>>>> and is actually a stronger guarantee.
>>>>>>>
>>>>>>> There is no guarantee on the accuracy of values, however, you can
>>>>>>> get a sense of the range of values along the value axis that correspond 
>>>>>>> to
>>>>>>> the confidence interval range along the rank axis.  For example:  
>>>>>>> Suppose
>>>>>>> your rank error is 1% and you getQuantile(0.5) which returns the value.
>>>>>>> Now you query getQuantile (.5 -.01) and getQuantile (.5+.01) and that 
>>>>>>> will
>>>>>>> give the corresponding range of values.
>>>>>>>
>>>>>>> Q2. Your Q2 is confusing, because I think you are conflating the
>>>>>>> accuracy (1/epsilon) with the confidence (1/delta).  The Omega equation 
>>>>>>> is
>>>>>>> O(sqrt(log(1/delta))/epsilon).
>>>>>>> There is actually a code comment in the Java version that states:
>>>>>>>
>>>>>>>   // constants were derived as the best fit to 99 percentile
>>>>>>>> empirically measured max error in
>>>>>>>>   // thousands of trials
>>>>>>>>   public static double getNormalizedRankError(final int k, final
>>>>>>>> boolean pmf) {
>>>>>>>>     return pmf
>>>>>>>>         ? 2.446 / pow(k, 0.9433)
>>>>>>>>         : 2.296 / pow(k, 0.9723);
>>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>> The hidden constant values behind the Omega equation above is not
>>>>>>> discussed in the paper.  So we ran many thousands of trails to 
>>>>>>> empirically
>>>>>>> determine what those constants are, and the result of that study is the
>>>>>>> equations given.
>>>>>>>
>>>>>>> Q3.  Space is initially limited to 3K, but for very long streams,
>>>>>>> the space will grow very slowly ...  something along the line of
>>>>>>> *log_base8(N/k)*.  We need to put growth tables on the web site
>>>>>>> like we did for the older QuantilesSketch, we just haven't gotten 
>>>>>>> around to
>>>>>>> it.
>>>>>>>
>>>>>>> You are correct that to quickly answer forward queries (like the
>>>>>>> Q(r)) we build a temporary, internal aux table.  If you have multiple
>>>>>>> queries from the same sketch this speeds up the query process
>>>>>>> considerably.  For all the other queries: R(q), CDF, PDF, this aux 
>>>>>>> table is
>>>>>>> not needed.  Also, when the sketch is serialized, this aux table is not
>>>>>>> needed as it can be constructed anytime.  You are correct that we could
>>>>>>> document this temporary size increase better.  However, the speed gain 
>>>>>>> is
>>>>>>> considerable, and most of our users prefer to have the speed gain at 
>>>>>>> query
>>>>>>> time and don't worry so much about the temporary allocation of the Aux
>>>>>>> table.  This is clearly a speed/ space optimization we made as part of 
>>>>>>> the
>>>>>>> engineering of the design.
>>>>>>>
>>>>>>> I hope this helps!
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Lee.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>> Hope you are well in these COVID times.
>>>>>>>>
>>>>>>>> I was going through implementation of the KLL algorithm based on
>>>>>>>> the paper "Streaming Quantiles Algorithms with Small Space and Update 
>>>>>>>> Time"
>>>>>>>> and wanted to understand it more to understand applications in 
>>>>>>>> databases.
>>>>>>>> (cpp version on apache github.)
>>>>>>>> I had a few questions related to the paper on which I couldn't get
>>>>>>>> my head around, I would appreciate if I could get some help on these:
>>>>>>>>
>>>>>>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>>>>>>        Does it mean that once a sketch has been constructed, we
>>>>>>>> start asking quantile queries. Confidence will tell how many of these
>>>>>>>> queries will fall under the normalized error bound.
>>>>>>>>         OR
>>>>>>>>        If I construct the sketch n times on any data and ask the
>>>>>>>> same query q each time, confidence will tell in how many cases out of 
>>>>>>>> these
>>>>>>>> n times q will fall under the normalized error bounds.
>>>>>>>>
>>>>>>>>        I tested the apache version with deterministic bits, that
>>>>>>>> you have under KLL_VALIDATION flag.
>>>>>>>>        There for many queries(different ones) I got values beyond
>>>>>>>> error bounds, and the number of such queries was way beyond confidence 
>>>>>>>> of
>>>>>>>> 99%.
>>>>>>>>        The deterministic bits followed a pattern (010101..), the
>>>>>>>> random bit generation could also at some time generate such a pattern.
>>>>>>>>        What do you do in such a case, where error bounds and
>>>>>>>> confidence both will not hold?
>>>>>>>>
>>>>>>>> Q2. Though the paper mentions a relation between error bound and
>>>>>>>> buffer size k, as O(sqrt(log e) / e))
>>>>>>>>         What will be the actual relation between these two wrt
>>>>>>>> constants as well, when we try to get a practical application of the
>>>>>>>> algorithm.
>>>>>>>>         In your implementation you have added normalized error to
>>>>>>>> be equal to 2.296/k^0.9723.
>>>>>>>>         I want to understand how you got to this value.
>>>>>>>>
>>>>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will
>>>>>>>> take more memory for storing weights to answer quantiles (after sorting
>>>>>>>> across levels at the end of sketch creation, otherwise bounds might not
>>>>>>>> hold). Why are these not included in memory taken by sketch?
>>>>>>>> As a practical application of the paper will need these memories
>>>>>>>> and an additional tmp storage space as well during cdf construction to
>>>>>>>> answer queries.
>>>>>>>>
>>>>>>>> Appreciate your time.
>>>>>>>> Thank You,
>>>>>>>> Gourav Kumar
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Regards,
>>> Gourav
>>>
>>

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to