I see a typo: What I called the Omega relation is actually Omicron (big "Oh") :)
On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <justin.r.tha...@gmail.com> wrote: > Your response looks great to me. > > I suppose Gourav will respond if he is still confused on any issues. And > hopefully Edo will read this over at some point and double-check both of > our work :-) > > On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote: > >> Hello Gourav, welcome to this forum! >> >> I want to make sure you have access to and have read the code >> documentation for the KLL sketch in addition to the papers. Although the >> code documentation exists for both Java and C++, it is a little easier to >> access the Javadocs as they are accessible from the website. There is a >> long section about the KLL sketch >> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> >> and its implementation that may answer some of your questions. >> >> It is important to emphasize that the error of the quantile sketch is >> only relevant to the rank and not the value. It is not clear from your >> question, which you are referring to. >> >> 1. I asked Justin Thaler, one of our scientists, how to better explain >> the error "guarantee" on the sketch. Hopefully, he can respond here if >> what I write here is not totally correct :) >> >> When you feed a stream of values into a KLL (or Quantiles) sketch the >> result is actually a set of values (kept internally by the sketch) that >> have been "sampled" from the stream using a random process. Let's call this >> set a "*result set*". When we issue a query against the sketch it >> analyzes this *result set* and returns a value (or a rank depending on >> the query). >> >> The guarantee goes something like this: >> >> With 99% probability, the *result set* of a sketch will be a set such >> that a *getQuantile(rank)* query from that *result set* will produce an >> *estimated quantile >> (or value)* such that the *true rank of the true value* will be >> within +/- the absolute fractional rank error returned by the function >> getNormalizedRankError(k, pmf). If the *result set* is a good one, >> i.e., one of the good 99%- of-the-time sketches, then the guarantee is that >> *all >> possible queries* will fall within the +/- rank error interval defined >> above. >> >> Because of the random sampling process used by the sketch, there is a 1% >> chance that the *result set* is bad. We don't know how bad: it could >> mess up only one query, or it could mess up all queries from that bad *result >> set*. >> >> This applies to both of your Q1 scenarios, given the caveat above and is >> actually a stronger guarantee. >> >> There is no guarantee on the accuracy of values, however, you can get a >> sense of the range of values along the value axis that correspond to the >> confidence interval range along the rank axis. For example: Suppose your >> rank error is 1% and you getQuantile(0.5) which returns the value. Now you >> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the >> corresponding range of values. >> >> Q2. Your Q2 is confusing, because I think you are conflating the accuracy >> (1/epsilon) with the confidence (1/delta). The Omega equation is >> O(sqrt(log(1/delta))/epsilon). >> There is actually a code comment in the Java version that states: >> >> // constants were derived as the best fit to 99 percentile empirically >>> measured max error in >>> // thousands of trials >>> public static double getNormalizedRankError(final int k, final boolean >>> pmf) { >>> return pmf >>> ? 2.446 / pow(k, 0.9433) >>> : 2.296 / pow(k, 0.9723); >>> } >> >> >> The hidden constant values behind the Omega equation above is not >> discussed in the paper. So we ran many thousands of trails to empirically >> determine what those constants are, and the result of that study is the >> equations given. >> >> Q3. Space is initially limited to 3K, but for very long streams, the >> space will grow very slowly ... something along the line of >> *log_base8(N/k)*. We need to put growth tables on the web site like we >> did for the older QuantilesSketch, we just haven't gotten around to it. >> >> You are correct that to quickly answer forward queries (like the Q(r)) we >> build a temporary, internal aux table. If you have multiple queries from >> the same sketch this speeds up the query process considerably. For all the >> other queries: R(q), CDF, PDF, this aux table is not needed. Also, when >> the sketch is serialized, this aux table is not needed as it can be >> constructed anytime. You are correct that we could document this temporary >> size increase better. However, the speed gain is considerable, and most of >> our users prefer to have the speed gain at query time and don't worry so >> much about the temporary allocation of the Aux table. This is clearly a >> speed/ space optimization we made as part of the engineering of the design. >> >> I hope this helps! >> >> Cheers, >> >> Lee. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> >> wrote: >> >>> Hi All, >>> Hope you are well in these COVID times. >>> >>> I was going through implementation of the KLL algorithm based on the >>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" and >>> wanted to understand it more to understand applications in databases. >>> (cpp version on apache github.) >>> I had a few questions related to the paper on which I couldn't get my >>> head around, I would appreciate if I could get some help on these: >>> >>> Q1. What is the meaning of confidence in the KLL algorithm? >>> Does it mean that once a sketch has been constructed, we start >>> asking quantile queries. Confidence will tell how many of these queries >>> will fall under the normalized error bound. >>> OR >>> If I construct the sketch n times on any data and ask the same >>> query q each time, confidence will tell in how many cases out of these n >>> times q will fall under the normalized error bounds. >>> >>> I tested the apache version with deterministic bits, that you >>> have under KLL_VALIDATION flag. >>> There for many queries(different ones) I got values beyond error >>> bounds, and the number of such queries was way beyond confidence of 99%. >>> The deterministic bits followed a pattern (010101..), the random >>> bit generation could also at some time generate such a pattern. >>> What do you do in such a case, where error bounds and confidence >>> both will not hold? >>> >>> Q2. Though the paper mentions a relation between error bound and buffer >>> size k, as O(sqrt(log e) / e)) >>> What will be the actual relation between these two wrt constants >>> as well, when we try to get a practical application of the algorithm. >>> In your implementation you have added normalized error to be >>> equal to 2.296/k^0.9723. >>> I want to understand how you got to this value. >>> >>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will take >>> more memory for storing weights to answer quantiles (after sorting across >>> levels at the end of sketch creation, otherwise bounds might not hold). Why >>> are these not included in memory taken by sketch? >>> As a practical application of the paper will need these memories and an >>> additional tmp storage space as well during cdf construction to answer >>> queries. >>> >>> Appreciate your time. >>> Thank You, >>> Gourav Kumar >>> >>