Perhaps, it would be helpful to look at how we measure the error of the
actual implementation to make sure it behaves as we expect according to the
theory. Some information including a plot is available here:
http://datasketches.apache.org/docs/Quantiles/KLLSketch.html

The characterization code, some measurements and Matlab scripts to plot the
results are available here:

https://github.com/apache/incubator-datasketches-characterization

On Tue, Jun 23, 2020 at 11:30 AM Justin Thaler <justin.r.tha...@gmail.com>
wrote:

> Hi Gourav,
>
> > Q1. From users perspective, can you tell apriori whenever the sketch
> constructed falls in the 1% category (bad case),
>       or maybe when the users asks a get_quantile query give some sort of
> indication around confidence or error bound for that rank query.
>        As a user, when I am relying on this algorithm how can I be sure
> that I am not in the 1% case.
>        If not, then as we don't have any bounds on rank in the bad case,
> will users be skeptical in using it?
>
> Before I answer this question directly, perhaps it will be helpful if I
> give more information about what is known about the error distribution of
> the KLL sketch.
>
> For any fixed universe item y, the estimated rank \hat{R}(y) of y derived
> from the KLL sketch will have essentially Gaussian error.
>
> More precisely, if R(y) is the true rank of y and \hat{R}(y) is the
> estimated rank of y as determined by the sketch, then Err(y) := \hat{R}(y)
> - R(y) is essentially a 0-mean Gaussian with bounded variance.
>
> (More precisely still, Err(y) is a weighted sum of Binomial random
> variables, and this sum is understood to behave very much like a 0-mean
> Gaussian).
>
> The KLL paper shows that the standard deviation of Err(y) is low---at most
> roughly epsilon * n, where n is stream length. This is low enough that one
> can conclude that with probability at least 1-delta over the randomness
> used within the generation of the sketch, |Err(y)| is at most epsilon * n.
>
> The KLL paper itself states theorems of the form "With probability at
> least 1-delta, |Err(y)| is at most epsilon * n". This kind of statement
> does not directly refer to Gaussianity of the error distribution for the
> estimated rank of y. But even though the statement in quotes doesn't assert
> Gaussianity, Gaussianity indeed holds and was actually critical to proving
> the statement in quotes.
>
> To be clear, if y and y' are two different items, then the estimates
> \hat{R}(y) and \hat{R}(y') returned by a KLL sketch are *not* independent.
> Both estimates are roughly Gaussian-distributed individually, but these two
> Gaussians are correlated.
>
> Finally (and related to the above comment about estimates for different y
> and y' being correlated), the KLL paper concludes that Err(y) is likely to
> be low (at most epsilon*n) for all items y simultaneously via what is
> called an "epsilon-net argument". Essentially this argument says that, so
> long as the sketch returns accurate estimates for ~1/epsilon items of
> "evenly spaced ranks" (i.e., ranks epsilon*n, 2*epsilon*n, 3*epsilon*n,
> ..., n) then the sketch will return accurate estimates for all possible
> queries simultaneously. By what's known as a union bound, the probability
> that the sketch is inaccurate for any of these 1/epsilon items is at most a
> 1/epsilon factor larger than the probability it is inaccurate on any single
> fixed query. The space usage of the KLL sketch, namely O((1/epsilon) *
> sqrt{log(1/delta)}), grows sufficiently slowly with the parameter delta
> that decreasing delta by a factor of epsilon (to account for the
> aforementioned probability increase that comes from the union bound) has a
> negligible effect on the algorithm's space usage, at least so long as delta
> is less than epsilon.
>
> Now, regarding your question:
>
> Let's imagine the user just asks a single query, say, "give me the
> estimated rank \hat{R}(y) of some specific item y that I care about".
>
> Then the sketch returns the estimate \hat{R}(y), which as I've explained
> is essentially a Gaussian random variable with expected value equal to the
> true rank R(y), and standard deviation at most epsilon*n.
>
> If the estimate happens to wind up way out in the tail of that random
> variable, the error will be large, and there's no way for the user to
> recognize that situation. That is, the user can completely understand the
> *distribution* of the error in the estimate, but won't have any way of
> knowing for any draw from that distribution what the error actually is.
>
> That said, there are standard techniques to increase confidence in the
> estimates, though these come at a price. For example, the user could
> simultaneously compute several (say 5 for concreteness) different
> independent copies of the sketch, and answer any query by taking the median
> of the 5 estimates returned by the 5 independent sketches. It is known that
> the probability the median of the estimates is "bad" is significantly less
> than the probability any individual estimate is "bad".
>
> On Tue, Jun 23, 2020 at 1:20 AM leerho <lee...@gmail.com> wrote:
>
>> Gourav,
>>
>> I will defer the answer to Q1 to either Justin or Edo, but the rest I can
>> answer.
>>
>> Q2: 99% was somewhat arbitrary and a compromise.  It took a huge amount
>> of carefully constructed characterization studies to establish the 99%
>> confidence level and to refine it to a much finer level was left to a
>> future task.
>>
>> Q3: The minimum width set to 8 is also speed / space compromise.  The
>> paper suggests a minimum of 2, but with only 2 values per level the process
>> resembles random sampling, which the paper also discusses.  However.
>> Installing a sampler at the tail was going to add quite a bit of complexity
>> to the code especially for merging.  And frankly, at the time, we were not
>> confident that merging with the sampler at the tail would be as robust.
>> Again, this was left to a future algorithm that could be proven to work in
>> practical situations.  Also a long tail of compactors of size 2 is
>> computationally very wasteful and slow.  By widening the tail to 8 we were
>> able to recover nearly all the speed of the older quantiles algorithm with
>> most of the benefit of the size reduction of KLL.   This is another example
>> of an engineering design decision where we were particularly concerned
>> about speed performance.  This was not arbitrary,  we did extensive
>> characterization studies of various widths: 2,4,6,8,10,12 and 8 seemed to
>> be the happy medium between speed, space and simplicity of implementation.
>>
>> Q4.  This is an easy question, but clearly not obvious.  The whole idea
>> of lazy promotion of the compactors is to keep as much data as possible at
>> the low levels as they have the most accurate representation of the input
>> stream.  If you focus on promoting the higher levels first you end up
>> degrading your accuracy faster.
>>
>> Cheers,
>>
>> Lee.
>>
>> On Mon, Jun 22, 2020 at 9:43 PM Gourav Kumar <gourav1...@gmail.com>
>> wrote:
>>
>>> Thanks Lee,
>>>
>>> This clears a lot of things.
>>> Although, I have few more questions on this.
>>>
>>> Q1. From users perspective, can you tell apriori whenever the sketch
>>> constructed falls in the 1% category (bad case),
>>>       or maybe when the users asks a get_quantile query give some sort
>>> of indication around confidence or error bound for that rank query.
>>>        As a user, when I am relying on this algorithm how can I be sure
>>> that I am not in the 1% case.
>>>        If not, then as we don't have any bounds on rank in the bad case,
>>> will users be skeptical in using it?
>>>
>>> Q2. How did you arrive at the confidence value of 99%?
>>>       The paper mentions some formula for delta, but it's not clear to
>>> me if it will translate to 99%.
>>>
>>> Q3. Why is min width's(m_) default value set to 8?
>>>        For compaction to be meaningful, In my opinion having a min width
>>> value greater than 1 would have been sufficient.
>>>        The other reason which comes to my mind is faster convergence and
>>> less number of compaction.
>>>        In that case why 8, why not any value larger than that too?
>>>
>>> Q4. Also in compaction, you always pick the first level which stores
>>> more elements than its capacity.
>>>        Would guarantees be affected or will it have any effect on the
>>> speed of the algorithm when you change this strategy to
>>>         lets say picking a level which overshoots its capacity by the
>>> largest amount?
>>>         In my opinion, this strategy would free up more space and should
>>> lead to lesser number of compaction and hence decreasing empirical error
>>> bounds.
>>>
>>> Thank You,
>>> Gourav
>>>
>>> On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov <
>>> sayda...@verizonmedia.com> wrote:
>>>
>>>> Adding the original poster just in case he is not subscribed to the list
>>>>
>>>> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:
>>>>
>>>>> I see a typo: What I called the Omega relation is actually Omicron
>>>>> (big "Oh")  :)
>>>>>
>>>>>
>>>>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <
>>>>> justin.r.tha...@gmail.com> wrote:
>>>>>
>>>>>> Your response looks great to me.
>>>>>>
>>>>>> I suppose Gourav will respond if he is still confused on any issues.
>>>>>> And hopefully Edo will read this over at some point and double-check both
>>>>>> of our work :-)
>>>>>>
>>>>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Gourav, welcome to this forum!
>>>>>>>
>>>>>>> I want to make sure you have access to and have read the code
>>>>>>> documentation for the KLL sketch in addition to the papers.  Although 
>>>>>>> the
>>>>>>> code documentation exists for both Java and C++, it is a little easier 
>>>>>>> to
>>>>>>> access the Javadocs as they are accessible from the website.  There is a
>>>>>>> long section about the KLL sketch
>>>>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>>>>>>> and its implementation that may answer some of your questions.
>>>>>>>
>>>>>>> It is important to emphasize that the error of the quantile sketch
>>>>>>> is only relevant to the rank and not the value. It is not clear from 
>>>>>>> your
>>>>>>> question, which you are referring to.
>>>>>>>
>>>>>>> 1. I asked Justin Thaler, one of our scientists, how to better
>>>>>>> explain the error "guarantee" on the sketch.  Hopefully, he can respond
>>>>>>> here if what I write here is not totally correct :)
>>>>>>>
>>>>>>> When you feed a stream of values into a KLL (or Quantiles) sketch
>>>>>>> the result is actually a set of values (kept internally by the sketch) 
>>>>>>> that
>>>>>>> have been "sampled" from the stream using a random process. Let's call 
>>>>>>> this
>>>>>>> set a "*result set*".  When we issue a query against the sketch it
>>>>>>> analyzes this *result set* and returns a value (or a rank depending
>>>>>>> on the query).
>>>>>>>
>>>>>>> The guarantee goes something like this:
>>>>>>>
>>>>>>> With 99% probability, the *result set* of a sketch will be a set
>>>>>>> such that a *getQuantile(rank)* query from that *result set* will
>>>>>>> produce an *estimated quantile (or value)* such that the *true rank
>>>>>>> of the true value* will be within +/- the absolute fractional rank
>>>>>>> error returned by the function getNormalizedRankError(k, pmf).  If the 
>>>>>>> *result
>>>>>>> set* is a good one, i.e., one of the good 99%- of-the-time
>>>>>>> sketches, then the guarantee is that *all possible queries* will
>>>>>>> fall within the +/- rank error interval defined above.
>>>>>>>
>>>>>>> Because of the random sampling process used by the sketch, there is
>>>>>>> a 1% chance that the *result set* is bad.  We don't know how bad:
>>>>>>> it could mess up only one query, or it could mess up all queries from 
>>>>>>> that
>>>>>>> bad *result set*.
>>>>>>>
>>>>>>> This applies to both of your Q1 scenarios, given the caveat above
>>>>>>> and is actually a stronger guarantee.
>>>>>>>
>>>>>>> There is no guarantee on the accuracy of values, however, you can
>>>>>>> get a sense of the range of values along the value axis that correspond 
>>>>>>> to
>>>>>>> the confidence interval range along the rank axis.  For example:  
>>>>>>> Suppose
>>>>>>> your rank error is 1% and you getQuantile(0.5) which returns the value.
>>>>>>> Now you query getQuantile (.5 -.01) and getQuantile (.5+.01) and that 
>>>>>>> will
>>>>>>> give the corresponding range of values.
>>>>>>>
>>>>>>> Q2. Your Q2 is confusing, because I think you are conflating the
>>>>>>> accuracy (1/epsilon) with the confidence (1/delta).  The Omega equation 
>>>>>>> is
>>>>>>> O(sqrt(log(1/delta))/epsilon).
>>>>>>> There is actually a code comment in the Java version that states:
>>>>>>>
>>>>>>>   // constants were derived as the best fit to 99 percentile
>>>>>>>> empirically measured max error in
>>>>>>>>   // thousands of trials
>>>>>>>>   public static double getNormalizedRankError(final int k, final
>>>>>>>> boolean pmf) {
>>>>>>>>     return pmf
>>>>>>>>         ? 2.446 / pow(k, 0.9433)
>>>>>>>>         : 2.296 / pow(k, 0.9723);
>>>>>>>>   }
>>>>>>>
>>>>>>>
>>>>>>> The hidden constant values behind the Omega equation above is not
>>>>>>> discussed in the paper.  So we ran many thousands of trails to 
>>>>>>> empirically
>>>>>>> determine what those constants are, and the result of that study is the
>>>>>>> equations given.
>>>>>>>
>>>>>>> Q3.  Space is initially limited to 3K, but for very long streams,
>>>>>>> the space will grow very slowly ...  something along the line of
>>>>>>> *log_base8(N/k)*.  We need to put growth tables on the web site
>>>>>>> like we did for the older QuantilesSketch, we just haven't gotten 
>>>>>>> around to
>>>>>>> it.
>>>>>>>
>>>>>>> You are correct that to quickly answer forward queries (like the
>>>>>>> Q(r)) we build a temporary, internal aux table.  If you have multiple
>>>>>>> queries from the same sketch this speeds up the query process
>>>>>>> considerably.  For all the other queries: R(q), CDF, PDF, this aux 
>>>>>>> table is
>>>>>>> not needed.  Also, when the sketch is serialized, this aux table is not
>>>>>>> needed as it can be constructed anytime.  You are correct that we could
>>>>>>> document this temporary size increase better.  However, the speed gain 
>>>>>>> is
>>>>>>> considerable, and most of our users prefer to have the speed gain at 
>>>>>>> query
>>>>>>> time and don't worry so much about the temporary allocation of the Aux
>>>>>>> table.  This is clearly a speed/ space optimization we made as part of 
>>>>>>> the
>>>>>>> engineering of the design.
>>>>>>>
>>>>>>> I hope this helps!
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Lee.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>> Hope you are well in these COVID times.
>>>>>>>>
>>>>>>>> I was going through implementation of the KLL algorithm based on
>>>>>>>> the paper "Streaming Quantiles Algorithms with Small Space and Update 
>>>>>>>> Time"
>>>>>>>> and wanted to understand it more to understand applications in 
>>>>>>>> databases.
>>>>>>>> (cpp version on apache github.)
>>>>>>>> I had a few questions related to the paper on which I couldn't get
>>>>>>>> my head around, I would appreciate if I could get some help on these:
>>>>>>>>
>>>>>>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>>>>>>        Does it mean that once a sketch has been constructed, we
>>>>>>>> start asking quantile queries. Confidence will tell how many of these
>>>>>>>> queries will fall under the normalized error bound.
>>>>>>>>         OR
>>>>>>>>        If I construct the sketch n times on any data and ask the
>>>>>>>> same query q each time, confidence will tell in how many cases out of 
>>>>>>>> these
>>>>>>>> n times q will fall under the normalized error bounds.
>>>>>>>>
>>>>>>>>        I tested the apache version with deterministic bits, that
>>>>>>>> you have under KLL_VALIDATION flag.
>>>>>>>>        There for many queries(different ones) I got values beyond
>>>>>>>> error bounds, and the number of such queries was way beyond confidence 
>>>>>>>> of
>>>>>>>> 99%.
>>>>>>>>        The deterministic bits followed a pattern (010101..), the
>>>>>>>> random bit generation could also at some time generate such a pattern.
>>>>>>>>        What do you do in such a case, where error bounds and
>>>>>>>> confidence both will not hold?
>>>>>>>>
>>>>>>>> Q2. Though the paper mentions a relation between error bound and
>>>>>>>> buffer size k, as O(sqrt(log e) / e))
>>>>>>>>         What will be the actual relation between these two wrt
>>>>>>>> constants as well, when we try to get a practical application of the
>>>>>>>> algorithm.
>>>>>>>>         In your implementation you have added normalized error to
>>>>>>>> be equal to 2.296/k^0.9723.
>>>>>>>>         I want to understand how you got to this value.
>>>>>>>>
>>>>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will
>>>>>>>> take more memory for storing weights to answer quantiles (after sorting
>>>>>>>> across levels at the end of sketch creation, otherwise bounds might not
>>>>>>>> hold). Why are these not included in memory taken by sketch?
>>>>>>>> As a practical application of the paper will need these memories
>>>>>>>> and an additional tmp storage space as well during cdf construction to
>>>>>>>> answer queries.
>>>>>>>>
>>>>>>>> Appreciate your time.
>>>>>>>> Thank You,
>>>>>>>> Gourav Kumar
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Regards,
>>> Gourav
>>>
>>

Reply via email to