Adding the original poster just in case he is not subscribed to the list On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:
> I see a typo: What I called the Omega relation is actually Omicron (big > "Oh") :) > > > On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <justin.r.tha...@gmail.com> > wrote: > >> Your response looks great to me. >> >> I suppose Gourav will respond if he is still confused on any issues. And >> hopefully Edo will read this over at some point and double-check both of >> our work :-) >> >> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote: >> >>> Hello Gourav, welcome to this forum! >>> >>> I want to make sure you have access to and have read the code >>> documentation for the KLL sketch in addition to the papers. Although the >>> code documentation exists for both Java and C++, it is a little easier to >>> access the Javadocs as they are accessible from the website. There is a >>> long section about the KLL sketch >>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> >>> and its implementation that may answer some of your questions. >>> >>> It is important to emphasize that the error of the quantile sketch is >>> only relevant to the rank and not the value. It is not clear from your >>> question, which you are referring to. >>> >>> 1. I asked Justin Thaler, one of our scientists, how to better explain >>> the error "guarantee" on the sketch. Hopefully, he can respond here if >>> what I write here is not totally correct :) >>> >>> When you feed a stream of values into a KLL (or Quantiles) sketch the >>> result is actually a set of values (kept internally by the sketch) that >>> have been "sampled" from the stream using a random process. Let's call this >>> set a "*result set*". When we issue a query against the sketch it >>> analyzes this *result set* and returns a value (or a rank depending on >>> the query). >>> >>> The guarantee goes something like this: >>> >>> With 99% probability, the *result set* of a sketch will be a set such >>> that a *getQuantile(rank)* query from that *result set* will produce an >>> *estimated quantile >>> (or value)* such that the *true rank of the true value* will be >>> within +/- the absolute fractional rank error returned by the function >>> getNormalizedRankError(k, pmf). If the *result set* is a good one, >>> i.e., one of the good 99%- of-the-time sketches, then the guarantee is that >>> *all >>> possible queries* will fall within the +/- rank error interval defined >>> above. >>> >>> Because of the random sampling process used by the sketch, there is a 1% >>> chance that the *result set* is bad. We don't know how bad: it could >>> mess up only one query, or it could mess up all queries from that bad >>> *result >>> set*. >>> >>> This applies to both of your Q1 scenarios, given the caveat above and is >>> actually a stronger guarantee. >>> >>> There is no guarantee on the accuracy of values, however, you can get a >>> sense of the range of values along the value axis that correspond to the >>> confidence interval range along the rank axis. For example: Suppose your >>> rank error is 1% and you getQuantile(0.5) which returns the value. Now you >>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the >>> corresponding range of values. >>> >>> Q2. Your Q2 is confusing, because I think you are conflating the >>> accuracy (1/epsilon) with the confidence (1/delta). The Omega equation is >>> O(sqrt(log(1/delta))/epsilon). >>> There is actually a code comment in the Java version that states: >>> >>> // constants were derived as the best fit to 99 percentile empirically >>>> measured max error in >>>> // thousands of trials >>>> public static double getNormalizedRankError(final int k, final >>>> boolean pmf) { >>>> return pmf >>>> ? 2.446 / pow(k, 0.9433) >>>> : 2.296 / pow(k, 0.9723); >>>> } >>> >>> >>> The hidden constant values behind the Omega equation above is not >>> discussed in the paper. So we ran many thousands of trails to empirically >>> determine what those constants are, and the result of that study is the >>> equations given. >>> >>> Q3. Space is initially limited to 3K, but for very long streams, the >>> space will grow very slowly ... something along the line of >>> *log_base8(N/k)*. We need to put growth tables on the web site like we >>> did for the older QuantilesSketch, we just haven't gotten around to it. >>> >>> You are correct that to quickly answer forward queries (like the Q(r)) >>> we build a temporary, internal aux table. If you have multiple queries >>> from the same sketch this speeds up the query process considerably. For >>> all the other queries: R(q), CDF, PDF, this aux table is not needed. Also, >>> when the sketch is serialized, this aux table is not needed as it can be >>> constructed anytime. You are correct that we could document this temporary >>> size increase better. However, the speed gain is considerable, and most of >>> our users prefer to have the speed gain at query time and don't worry so >>> much about the temporary allocation of the Aux table. This is clearly a >>> speed/ space optimization we made as part of the engineering of the design. >>> >>> I hope this helps! >>> >>> Cheers, >>> >>> Lee. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> Hope you are well in these COVID times. >>>> >>>> I was going through implementation of the KLL algorithm based on the >>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" and >>>> wanted to understand it more to understand applications in databases. >>>> (cpp version on apache github.) >>>> I had a few questions related to the paper on which I couldn't get my >>>> head around, I would appreciate if I could get some help on these: >>>> >>>> Q1. What is the meaning of confidence in the KLL algorithm? >>>> Does it mean that once a sketch has been constructed, we start >>>> asking quantile queries. Confidence will tell how many of these queries >>>> will fall under the normalized error bound. >>>> OR >>>> If I construct the sketch n times on any data and ask the same >>>> query q each time, confidence will tell in how many cases out of these n >>>> times q will fall under the normalized error bounds. >>>> >>>> I tested the apache version with deterministic bits, that you >>>> have under KLL_VALIDATION flag. >>>> There for many queries(different ones) I got values beyond error >>>> bounds, and the number of such queries was way beyond confidence of 99%. >>>> The deterministic bits followed a pattern (010101..), the random >>>> bit generation could also at some time generate such a pattern. >>>> What do you do in such a case, where error bounds and confidence >>>> both will not hold? >>>> >>>> Q2. Though the paper mentions a relation between error bound and buffer >>>> size k, as O(sqrt(log e) / e)) >>>> What will be the actual relation between these two wrt >>>> constants as well, when we try to get a practical application of the >>>> algorithm. >>>> In your implementation you have added normalized error to be >>>> equal to 2.296/k^0.9723. >>>> I want to understand how you got to this value. >>>> >>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will take >>>> more memory for storing weights to answer quantiles (after sorting across >>>> levels at the end of sketch creation, otherwise bounds might not hold). Why >>>> are these not included in memory taken by sketch? >>>> As a practical application of the paper will need these memories and an >>>> additional tmp storage space as well during cdf construction to answer >>>> queries. >>>> >>>> Appreciate your time. >>>> Thank You, >>>> Gourav Kumar >>>> >>>