Hello Gourav, welcome to this forum! I want to make sure you have access to and have read the code documentation for the KLL sketch in addition to the papers. Although the code documentation exists for both Java and C++, it is a little easier to access the Javadocs as they are accessible from the website. There is a long section about the KLL sketch <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> and its implementation that may answer some of your questions.
It is important to emphasize that the error of the quantile sketch is only relevant to the rank and not the value. It is not clear from your question, which you are referring to. 1. I asked Justin Thaler, one of our scientists, how to better explain the error "guarantee" on the sketch. Hopefully, he can respond here if what I write here is not totally correct :) When you feed a stream of values into a KLL (or Quantiles) sketch the result is actually a set of values (kept internally by the sketch) that have been "sampled" from the stream using a random process. Let's call this set a "*result set*". When we issue a query against the sketch it analyzes this *result set* and returns a value (or a rank depending on the query). The guarantee goes something like this: With 99% probability, the *result set* of a sketch will be a set such that a *getQuantile(rank)* query from that *result set* will produce an *estimated quantile (or value)* such that the *true rank of the true value* will be within +/- the absolute fractional rank error returned by the function getNormalizedRankError(k, pmf). If the *result set* is a good one, i.e., one of the good 99%- of-the-time sketches, then the guarantee is that *all possible queries* will fall within the +/- rank error interval defined above. Because of the random sampling process used by the sketch, there is a 1% chance that the *result set* is bad. We don't know how bad: it could mess up only one query, or it could mess up all queries from that bad *result set*. This applies to both of your Q1 scenarios, given the caveat above and is actually a stronger guarantee. There is no guarantee on the accuracy of values, however, you can get a sense of the range of values along the value axis that correspond to the confidence interval range along the rank axis. For example: Suppose your rank error is 1% and you getQuantile(0.5) which returns the value. Now you query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the corresponding range of values. Q2. Your Q2 is confusing, because I think you are conflating the accuracy (1/epsilon) with the confidence (1/delta). The Omega equation is O(sqrt(log(1/delta))/epsilon). There is actually a code comment in the Java version that states: // constants were derived as the best fit to 99 percentile empirically > measured max error in > // thousands of trials > public static double getNormalizedRankError(final int k, final boolean > pmf) { > return pmf > ? 2.446 / pow(k, 0.9433) > : 2.296 / pow(k, 0.9723); > } The hidden constant values behind the Omega equation above is not discussed in the paper. So we ran many thousands of trails to empirically determine what those constants are, and the result of that study is the equations given. Q3. Space is initially limited to 3K, but for very long streams, the space will grow very slowly ... something along the line of *log_base8(N/k)*. We need to put growth tables on the web site like we did for the older QuantilesSketch, we just haven't gotten around to it. You are correct that to quickly answer forward queries (like the Q(r)) we build a temporary, internal aux table. If you have multiple queries from the same sketch this speeds up the query process considerably. For all the other queries: R(q), CDF, PDF, this aux table is not needed. Also, when the sketch is serialized, this aux table is not needed as it can be constructed anytime. You are correct that we could document this temporary size increase better. However, the speed gain is considerable, and most of our users prefer to have the speed gain at query time and don't worry so much about the temporary allocation of the Aux table. This is clearly a speed/ space optimization we made as part of the engineering of the design. I hope this helps! Cheers, Lee. On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> wrote: > Hi All, > Hope you are well in these COVID times. > > I was going through implementation of the KLL algorithm based on the paper > "Streaming Quantiles Algorithms with Small Space and Update Time" and > wanted to understand it more to understand applications in databases. > (cpp version on apache github.) > I had a few questions related to the paper on which I couldn't get my head > around, I would appreciate if I could get some help on these: > > Q1. What is the meaning of confidence in the KLL algorithm? > Does it mean that once a sketch has been constructed, we start > asking quantile queries. Confidence will tell how many of these queries > will fall under the normalized error bound. > OR > If I construct the sketch n times on any data and ask the same > query q each time, confidence will tell in how many cases out of these n > times q will fall under the normalized error bounds. > > I tested the apache version with deterministic bits, that you have > under KLL_VALIDATION flag. > There for many queries(different ones) I got values beyond error > bounds, and the number of such queries was way beyond confidence of 99%. > The deterministic bits followed a pattern (010101..), the random > bit generation could also at some time generate such a pattern. > What do you do in such a case, where error bounds and confidence > both will not hold? > > Q2. Though the paper mentions a relation between error bound and buffer > size k, as O(sqrt(log e) / e)) > What will be the actual relation between these two wrt constants > as well, when we try to get a practical application of the algorithm. > In your implementation you have added normalized error to be equal > to 2.296/k^0.9723. > I want to understand how you got to this value. > > Q3. Space taken by sketch is bounded by 3K, but the algorithm will take > more memory for storing weights to answer quantiles (after sorting across > levels at the end of sketch creation, otherwise bounds might not hold). Why > are these not included in memory taken by sketch? > As a practical application of the paper will need these memories and an > additional tmp storage space as well during cdf construction to answer > queries. > > Appreciate your time. > Thank You, > Gourav Kumar >