Re: Regarding error bounds and confidence of apache KLL implementation

Alexander Saydakov Mon, 22 Jun 2020 19:27:03 -0700

Adding the original poster just in case he is not subscribed to the list

On Mon, Jun 22, 2020 at 7:18 PM leerho <lee...@gmail.com> wrote:


> I see a typo: What I called the Omega relation is actually Omicron (big
> "Oh")  :)
>
>
> On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <justin.r.tha...@gmail.com>
> wrote:
>
>> Your response looks great to me.
>>
>> I suppose Gourav will respond if he is still confused on any issues. And
>> hopefully Edo will read this over at some point and double-check both of
>> our work :-)
>>
>> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>>
>>> Hello Gourav, welcome to this forum!
>>>
>>> I want to make sure you have access to and have read the code
>>> documentation for the KLL sketch in addition to the papers.  Although the
>>> code documentation exists for both Java and C++, it is a little easier to
>>> access the Javadocs as they are accessible from the website.  There is a
>>> long section about the KLL sketch
>>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>>> and its implementation that may answer some of your questions.
>>>
>>> It is important to emphasize that the error of the quantile sketch is
>>> only relevant to the rank and not the value. It is not clear from your
>>> question, which you are referring to.
>>>
>>> 1. I asked Justin Thaler, one of our scientists, how to better explain
>>> the error "guarantee" on the sketch.  Hopefully, he can respond here if
>>> what I write here is not totally correct :)
>>>
>>> When you feed a stream of values into a KLL (or Quantiles) sketch the
>>> result is actually a set of values (kept internally by the sketch) that
>>> have been "sampled" from the stream using a random process. Let's call this
>>> set a "*result set*".  When we issue a query against the sketch it
>>> analyzes this *result set* and returns a value (or a rank depending on
>>> the query).
>>>
>>> The guarantee goes something like this:
>>>
>>> With 99% probability, the *result set* of a sketch will be a set such
>>> that a *getQuantile(rank)* query from that *result set* will produce an 
>>> *estimated quantile
>>> (or value)* such that the *true rank of the true value* will be
>>> within +/- the absolute fractional rank error returned by the function
>>> getNormalizedRankError(k, pmf).  If the *result set* is a good one,
>>> i.e., one of the good 99%- of-the-time sketches, then the guarantee is that 
>>> *all
>>> possible queries* will fall within the +/- rank error interval defined
>>> above.
>>>
>>> Because of the random sampling process used by the sketch, there is a 1%
>>> chance that the *result set* is bad.  We don't know how bad:  it could
>>> mess up only one query, or it could mess up all queries from that bad 
>>> *result
>>> set*.
>>>
>>> This applies to both of your Q1 scenarios, given the caveat above and is
>>> actually a stronger guarantee.
>>>
>>> There is no guarantee on the accuracy of values, however, you can get a
>>> sense of the range of values along the value axis that correspond to the
>>> confidence interval range along the rank axis.  For example:  Suppose your
>>> rank error is 1% and you getQuantile(0.5) which returns the value.  Now you
>>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the
>>> corresponding range of values.
>>>
>>> Q2. Your Q2 is confusing, because I think you are conflating the
>>> accuracy (1/epsilon) with the confidence (1/delta).  The Omega equation is
>>> O(sqrt(log(1/delta))/epsilon).
>>> There is actually a code comment in the Java version that states:
>>>
>>>   // constants were derived as the best fit to 99 percentile empirically
>>>> measured max error in
>>>>   // thousands of trials
>>>>   public static double getNormalizedRankError(final int k, final
>>>> boolean pmf) {
>>>>     return pmf
>>>>         ? 2.446 / pow(k, 0.9433)
>>>>         : 2.296 / pow(k, 0.9723);
>>>>   }
>>>
>>>
>>> The hidden constant values behind the Omega equation above is not
>>> discussed in the paper.  So we ran many thousands of trails to empirically
>>> determine what those constants are, and the result of that study is the
>>> equations given.
>>>
>>> Q3.  Space is initially limited to 3K, but for very long streams, the
>>> space will grow very slowly ...  something along the line of
>>> *log_base8(N/k)*.  We need to put growth tables on the web site like we
>>> did for the older QuantilesSketch, we just haven't gotten around to it.
>>>
>>> You are correct that to quickly answer forward queries (like the Q(r))
>>> we build a temporary, internal aux table.  If you have multiple queries
>>> from the same sketch this speeds up the query process considerably.  For
>>> all the other queries: R(q), CDF, PDF, this aux table is not needed.  Also,
>>> when the sketch is serialized, this aux table is not needed as it can be
>>> constructed anytime.  You are correct that we could document this temporary
>>> size increase better.  However, the speed gain is considerable, and most of
>>> our users prefer to have the speed gain at query time and don't worry so
>>> much about the temporary allocation of the Aux table.  This is clearly a
>>> speed/ space optimization we made as part of the engineering of the design.
>>>
>>> I hope this helps!
>>>
>>> Cheers,
>>>
>>> Lee.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>> Hope you are well in these COVID times.
>>>>
>>>> I was going through implementation of the KLL algorithm based on the
>>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" and
>>>> wanted to understand it more to understand applications in databases.
>>>> (cpp version on apache github.)
>>>> I had a few questions related to the paper on which I couldn't get my
>>>> head around, I would appreciate if I could get some help on these:
>>>>
>>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>>        Does it mean that once a sketch has been constructed, we start
>>>> asking quantile queries. Confidence will tell how many of these queries
>>>> will fall under the normalized error bound.
>>>>         OR
>>>>        If I construct the sketch n times on any data and ask the same
>>>> query q each time, confidence will tell in how many cases out of these n
>>>> times q will fall under the normalized error bounds.
>>>>
>>>>        I tested the apache version with deterministic bits, that you
>>>> have under KLL_VALIDATION flag.
>>>>        There for many queries(different ones) I got values beyond error
>>>> bounds, and the number of such queries was way beyond confidence of 99%.
>>>>        The deterministic bits followed a pattern (010101..), the random
>>>> bit generation could also at some time generate such a pattern.
>>>>        What do you do in such a case, where error bounds and confidence
>>>> both will not hold?
>>>>
>>>> Q2. Though the paper mentions a relation between error bound and buffer
>>>> size k, as O(sqrt(log e) / e))
>>>>         What will be the actual relation between these two wrt
>>>> constants as well, when we try to get a practical application of the
>>>> algorithm.
>>>>         In your implementation you have added normalized error to be
>>>> equal to 2.296/k^0.9723.
>>>>         I want to understand how you got to this value.
>>>>
>>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will take
>>>> more memory for storing weights to answer quantiles (after sorting across
>>>> levels at the end of sketch creation, otherwise bounds might not hold). Why
>>>> are these not included in memory taken by sketch?
>>>> As a practical application of the paper will need these memories and an
>>>> additional tmp storage space as well during cdf construction to answer
>>>> queries.
>>>>
>>>> Appreciate your time.
>>>> Thank You,
>>>> Gourav Kumar
>>>>
>>>

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to