Perhaps, it would be helpful to look at how we measure the error of the actual implementation to make sure it behaves as we expect according to the theory. Some information including a plot is available here: http://datasketches.apache.org/docs/Quantiles/KLLSketch.html
The characterization code, some measurements and Matlab scripts to plot the results are available here: https://github.com/apache/incubator-datasketches-characterization On Tue, Jun 23, 2020 at 11:30 AM Justin Thaler <justin.r.tha...@gmail.com> wrote: > Hi Gourav, > > > Q1. From users perspective, can you tell apriori whenever the sketch > constructed falls in the 1% category (bad case), > or maybe when the users asks a get_quantile query give some sort of > indication around confidence or error bound for that rank query. > As a user, when I am relying on this algorithm how can I be sure > that I am not in the 1% case. > If not, then as we don't have any bounds on rank in the bad case, > will users be skeptical in using it? > > Before I answer this question directly, perhaps it will be helpful if I > give more information about what is known about the error distribution of > the KLL sketch. > > For any fixed universe item y, the estimated rank \hat{R}(y) of y derived > from the KLL sketch will have essentially Gaussian error. > > More precisely, if R(y) is the true rank of y and \hat{R}(y) is the > estimated rank of y as determined by the sketch, then Err(y) := \hat{R}(y) > - R(y) is essentially a 0-mean Gaussian with bounded variance. > > (More precisely still, Err(y) is a weighted sum of Binomial random > variables, and this sum is understood to behave very much like a 0-mean > Gaussian). > > The KLL paper shows that the standard deviation of Err(y) is low---at most > roughly epsilon * n, where n is stream length. This is low enough that one > can conclude that with probability at least 1-delta over the randomness > used within the generation of the sketch, |Err(y)| is at most epsilon * n. > > The KLL paper itself states theorems of the form "With probability at > least 1-delta, |Err(y)| is at most epsilon * n". This kind of statement > does not directly refer to Gaussianity of the error distribution for the > estimated rank of y. But even though the statement in quotes doesn't assert > Gaussianity, Gaussianity indeed holds and was actually critical to proving > the statement in quotes. > > To be clear, if y and y' are two different items, then the estimates > \hat{R}(y) and \hat{R}(y') returned by a KLL sketch are *not* independent. > Both estimates are roughly Gaussian-distributed individually, but these two > Gaussians are correlated. > > Finally (and related to the above comment about estimates for different y > and y' being correlated), the KLL paper concludes that Err(y) is likely to > be low (at most epsilon*n) for all items y simultaneously via what is > called an "epsilon-net argument". Essentially this argument says that, so > long as the sketch returns accurate estimates for ~1/epsilon items of > "evenly spaced ranks" (i.e., ranks epsilon*n, 2*epsilon*n, 3*epsilon*n, > ..., n) then the sketch will return accurate estimates for all possible > queries simultaneously. By what's known as a union bound, the probability > that the sketch is inaccurate for any of these 1/epsilon items is at most a > 1/epsilon factor larger than the probability it is inaccurate on any single > fixed query. The space usage of the KLL sketch, namely O((1/epsilon) * > sqrt{log(1/delta)}), grows sufficiently slowly with the parameter delta > that decreasing delta by a factor of epsilon (to account for the > aforementioned probability increase that comes from the union bound) has a > negligible effect on the algorithm's space usage, at least so long as delta > is less than epsilon. > > Now, regarding your question: > > Let's imagine the user just asks a single query, say, "give me the > estimated rank \hat{R}(y) of some specific item y that I care about". > > Then the sketch returns the estimate \hat{R}(y), which as I've explained > is essentially a Gaussian random variable with expected value equal to the > true rank R(y), and standard deviation at most epsilon*n. > > If the estimate happens to wind up way out in the tail of that random > variable, the error will be large, and there's no way for the user to > recognize that situation. That is, the user can completely understand the > *distribution* of the error in the estimate, but won't have any way of > knowing for any draw from that distribution what the error actually is. > > That said, there are standard techniques to increase confidence in the > estimates, though these come at a price. For example, the user could > simultaneously compute several (say 5 for concreteness) different > independent copies of the sketch, and answer any query by taking the median > of the 5 estimates returned by the 5 independent sketches. It is known that > the probability the median of the estimates is "bad" is significantly less > than the probability any individual estimate is "bad". > > On Tue, Jun 23, 2020 at 1:20 AM leerho <lee...@gmail.com> wrote: > >> Gourav, >> >> I will defer the answer to Q1 to either Justin or Edo, but the rest I can >> answer. >> >> Q2: 99% was somewhat arbitrary and a compromise. It took a huge amount >> of carefully constructed characterization studies to establish the 99% >> confidence level and to refine it to a much finer level was left to a >> future task. >> >> Q3: The minimum width set to 8 is also speed / space compromise. The >> paper suggests a minimum of 2, but with only 2 values per level the process >> resembles random sampling, which the paper also discusses. However. >> Installing a sampler at the tail was going to add quite a bit of complexity >> to the code especially for merging. And frankly, at the time, we were not >> confident that merging with the sampler at the tail would be as robust. >> Again, this was left to a future algorithm that could be proven to work in >> practical situations. Also a long tail of compactors of size 2 is >> computationally very wasteful and slow. By widening the tail to 8 we were >> able to recover nearly all the speed of the older quantiles algorithm with >> most of the benefit of the size reduction of KLL. This is another example >> of an engineering design decision where we were particularly concerned >> about speed performance. This was not arbitrary, we did extensive >> characterization studies of various widths: 2,4,6,8,10,12 and 8 seemed to >> be the happy medium between speed, space and simplicity of implementation. >> >> Q4. This is an easy question, but clearly not obvious. The whole idea >> of lazy promotion of the compactors is to keep as much data as possible at >> the low levels as they have the most accurate representation of the input >> stream. If you focus on promoting the higher levels first you end up >> degrading your accuracy faster. >> >> Cheers, >> >> Lee. >> >> On Mon, Jun 22, 2020 at 9:43 PM Gourav Kumar <gourav1...@gmail.com> >> wrote: >> >>> Thanks Lee, >>> >>> This clears a lot of things. >>> Although, I have few more questions on this. >>> >>> Q1. From users perspective, can you tell apriori whenever the sketch >>> constructed falls in the 1% category (bad case), >>> or maybe when the users asks a get_quantile query give some sort >>> of indication around confidence or error bound for that rank query. >>> As a user, when I am relying on this algorithm how can I be sure >>> that I am not in the 1% case. >>> If not, then as we don't have any bounds on rank in the bad case, >>> will users be skeptical in using it? >>> >>> Q2. How did you arrive at the confidence value of 99%? >>> The paper mentions some formula for delta, but it's not clear to >>> me if it will translate to 99%. >>> >>> Q3. Why is min width's(m_) default value set to 8? >>> For compaction to be meaningful, In my opinion having a min width >>> value greater than 1 would have been sufficient. >>> The other reason which comes to my mind is faster convergence and >>> less number of compaction. >>> In that case why 8, why not any value larger than that too? >>> >>> Q4. Also in compaction, you always pick the first level which stores >>> more elements than its capacity. >>> Would guarantees be affected or will it have any effect on the >>> speed of the algorithm when you change this strategy to >>> lets say picking a level which overshoots its capacity by the >>> largest amount? >>> In my opinion, this strategy would free up more space and should >>> lead to lesser number of compaction and hence decreasing empirical error >>> bounds. >>> >>> Thank You, >>> Gourav >>> >>> On Tue, 23 Jun 2020 at 07:56, Alexander Saydakov < >>> sayda...@verizonmedia.com> wrote: >>> >>>> Adding the original poster just in case he is not subscribed to the list >>>> >>>> On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote: >>>> >>>>> I see a typo: What I called the Omega relation is actually Omicron >>>>> (big "Oh") :) >>>>> >>>>> >>>>> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler < >>>>> justin.r.tha...@gmail.com> wrote: >>>>> >>>>>> Your response looks great to me. >>>>>> >>>>>> I suppose Gourav will respond if he is still confused on any issues. >>>>>> And hopefully Edo will read this over at some point and double-check both >>>>>> of our work :-) >>>>>> >>>>>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote: >>>>>> >>>>>>> Hello Gourav, welcome to this forum! >>>>>>> >>>>>>> I want to make sure you have access to and have read the code >>>>>>> documentation for the KLL sketch in addition to the papers. Although >>>>>>> the >>>>>>> code documentation exists for both Java and C++, it is a little easier >>>>>>> to >>>>>>> access the Javadocs as they are accessible from the website. There is a >>>>>>> long section about the KLL sketch >>>>>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> >>>>>>> and its implementation that may answer some of your questions. >>>>>>> >>>>>>> It is important to emphasize that the error of the quantile sketch >>>>>>> is only relevant to the rank and not the value. It is not clear from >>>>>>> your >>>>>>> question, which you are referring to. >>>>>>> >>>>>>> 1. I asked Justin Thaler, one of our scientists, how to better >>>>>>> explain the error "guarantee" on the sketch. Hopefully, he can respond >>>>>>> here if what I write here is not totally correct :) >>>>>>> >>>>>>> When you feed a stream of values into a KLL (or Quantiles) sketch >>>>>>> the result is actually a set of values (kept internally by the sketch) >>>>>>> that >>>>>>> have been "sampled" from the stream using a random process. Let's call >>>>>>> this >>>>>>> set a "*result set*". When we issue a query against the sketch it >>>>>>> analyzes this *result set* and returns a value (or a rank depending >>>>>>> on the query). >>>>>>> >>>>>>> The guarantee goes something like this: >>>>>>> >>>>>>> With 99% probability, the *result set* of a sketch will be a set >>>>>>> such that a *getQuantile(rank)* query from that *result set* will >>>>>>> produce an *estimated quantile (or value)* such that the *true rank >>>>>>> of the true value* will be within +/- the absolute fractional rank >>>>>>> error returned by the function getNormalizedRankError(k, pmf). If the >>>>>>> *result >>>>>>> set* is a good one, i.e., one of the good 99%- of-the-time >>>>>>> sketches, then the guarantee is that *all possible queries* will >>>>>>> fall within the +/- rank error interval defined above. >>>>>>> >>>>>>> Because of the random sampling process used by the sketch, there is >>>>>>> a 1% chance that the *result set* is bad. We don't know how bad: >>>>>>> it could mess up only one query, or it could mess up all queries from >>>>>>> that >>>>>>> bad *result set*. >>>>>>> >>>>>>> This applies to both of your Q1 scenarios, given the caveat above >>>>>>> and is actually a stronger guarantee. >>>>>>> >>>>>>> There is no guarantee on the accuracy of values, however, you can >>>>>>> get a sense of the range of values along the value axis that correspond >>>>>>> to >>>>>>> the confidence interval range along the rank axis. For example: >>>>>>> Suppose >>>>>>> your rank error is 1% and you getQuantile(0.5) which returns the value. >>>>>>> Now you query getQuantile (.5 -.01) and getQuantile (.5+.01) and that >>>>>>> will >>>>>>> give the corresponding range of values. >>>>>>> >>>>>>> Q2. Your Q2 is confusing, because I think you are conflating the >>>>>>> accuracy (1/epsilon) with the confidence (1/delta). The Omega equation >>>>>>> is >>>>>>> O(sqrt(log(1/delta))/epsilon). >>>>>>> There is actually a code comment in the Java version that states: >>>>>>> >>>>>>> // constants were derived as the best fit to 99 percentile >>>>>>>> empirically measured max error in >>>>>>>> // thousands of trials >>>>>>>> public static double getNormalizedRankError(final int k, final >>>>>>>> boolean pmf) { >>>>>>>> return pmf >>>>>>>> ? 2.446 / pow(k, 0.9433) >>>>>>>> : 2.296 / pow(k, 0.9723); >>>>>>>> } >>>>>>> >>>>>>> >>>>>>> The hidden constant values behind the Omega equation above is not >>>>>>> discussed in the paper. So we ran many thousands of trails to >>>>>>> empirically >>>>>>> determine what those constants are, and the result of that study is the >>>>>>> equations given. >>>>>>> >>>>>>> Q3. Space is initially limited to 3K, but for very long streams, >>>>>>> the space will grow very slowly ... something along the line of >>>>>>> *log_base8(N/k)*. We need to put growth tables on the web site >>>>>>> like we did for the older QuantilesSketch, we just haven't gotten >>>>>>> around to >>>>>>> it. >>>>>>> >>>>>>> You are correct that to quickly answer forward queries (like the >>>>>>> Q(r)) we build a temporary, internal aux table. If you have multiple >>>>>>> queries from the same sketch this speeds up the query process >>>>>>> considerably. For all the other queries: R(q), CDF, PDF, this aux >>>>>>> table is >>>>>>> not needed. Also, when the sketch is serialized, this aux table is not >>>>>>> needed as it can be constructed anytime. You are correct that we could >>>>>>> document this temporary size increase better. However, the speed gain >>>>>>> is >>>>>>> considerable, and most of our users prefer to have the speed gain at >>>>>>> query >>>>>>> time and don't worry so much about the temporary allocation of the Aux >>>>>>> table. This is clearly a speed/ space optimization we made as part of >>>>>>> the >>>>>>> engineering of the design. >>>>>>> >>>>>>> I hope this helps! >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Lee. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> Hope you are well in these COVID times. >>>>>>>> >>>>>>>> I was going through implementation of the KLL algorithm based on >>>>>>>> the paper "Streaming Quantiles Algorithms with Small Space and Update >>>>>>>> Time" >>>>>>>> and wanted to understand it more to understand applications in >>>>>>>> databases. >>>>>>>> (cpp version on apache github.) >>>>>>>> I had a few questions related to the paper on which I couldn't get >>>>>>>> my head around, I would appreciate if I could get some help on these: >>>>>>>> >>>>>>>> Q1. What is the meaning of confidence in the KLL algorithm? >>>>>>>> Does it mean that once a sketch has been constructed, we >>>>>>>> start asking quantile queries. Confidence will tell how many of these >>>>>>>> queries will fall under the normalized error bound. >>>>>>>> OR >>>>>>>> If I construct the sketch n times on any data and ask the >>>>>>>> same query q each time, confidence will tell in how many cases out of >>>>>>>> these >>>>>>>> n times q will fall under the normalized error bounds. >>>>>>>> >>>>>>>> I tested the apache version with deterministic bits, that >>>>>>>> you have under KLL_VALIDATION flag. >>>>>>>> There for many queries(different ones) I got values beyond >>>>>>>> error bounds, and the number of such queries was way beyond confidence >>>>>>>> of >>>>>>>> 99%. >>>>>>>> The deterministic bits followed a pattern (010101..), the >>>>>>>> random bit generation could also at some time generate such a pattern. >>>>>>>> What do you do in such a case, where error bounds and >>>>>>>> confidence both will not hold? >>>>>>>> >>>>>>>> Q2. Though the paper mentions a relation between error bound and >>>>>>>> buffer size k, as O(sqrt(log e) / e)) >>>>>>>> What will be the actual relation between these two wrt >>>>>>>> constants as well, when we try to get a practical application of the >>>>>>>> algorithm. >>>>>>>> In your implementation you have added normalized error to >>>>>>>> be equal to 2.296/k^0.9723. >>>>>>>> I want to understand how you got to this value. >>>>>>>> >>>>>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will >>>>>>>> take more memory for storing weights to answer quantiles (after sorting >>>>>>>> across levels at the end of sketch creation, otherwise bounds might not >>>>>>>> hold). Why are these not included in memory taken by sketch? >>>>>>>> As a practical application of the paper will need these memories >>>>>>>> and an additional tmp storage space as well during cdf construction to >>>>>>>> answer queries. >>>>>>>> >>>>>>>> Appreciate your time. >>>>>>>> Thank You, >>>>>>>> Gourav Kumar >>>>>>>> >>>>>>> >>> >>> -- >>> Regards, >>> Gourav >>> >>