Re: Regarding error bounds and confidence of apache KLL implementation

leerho Mon, 22 Jun 2020 19:19:03 -0700

I see a typo: What I called the Omega relation is actually Omicron (big
"Oh")  :)



On Mon, Jun 22, 2020 at 6:22 PM Justin Thaler <justin.r.tha...@gmail.com>
wrote:

> Your response looks great to me.
>
> I suppose Gourav will respond if he is still confused on any issues. And
> hopefully Edo will read this over at some point and double-check both of
> our work :-)
>
> On Mon, Jun 22, 2020 at 9:14 PM leerho <lee...@gmail.com> wrote:
>
>> Hello Gourav, welcome to this forum!
>>
>> I want to make sure you have access to and have read the code
>> documentation for the KLL sketch in addition to the papers.  Although the
>> code documentation exists for both Java and C++, it is a little easier to
>> access the Javadocs as they are accessible from the website.  There is a
>> long section about the KLL sketch
>> <https://datasketches.apache.org/api/java/snapshot/apidocs/index.html>
>> and its implementation that may answer some of your questions.
>>
>> It is important to emphasize that the error of the quantile sketch is
>> only relevant to the rank and not the value. It is not clear from your
>> question, which you are referring to.
>>
>> 1. I asked Justin Thaler, one of our scientists, how to better explain
>> the error "guarantee" on the sketch.  Hopefully, he can respond here if
>> what I write here is not totally correct :)
>>
>> When you feed a stream of values into a KLL (or Quantiles) sketch the
>> result is actually a set of values (kept internally by the sketch) that
>> have been "sampled" from the stream using a random process. Let's call this
>> set a "*result set*".  When we issue a query against the sketch it
>> analyzes this *result set* and returns a value (or a rank depending on
>> the query).
>>
>> The guarantee goes something like this:
>>
>> With 99% probability, the *result set* of a sketch will be a set such
>> that a *getQuantile(rank)* query from that *result set* will produce an 
>> *estimated quantile
>> (or value)* such that the *true rank of the true value* will be
>> within +/- the absolute fractional rank error returned by the function
>> getNormalizedRankError(k, pmf).  If the *result set* is a good one,
>> i.e., one of the good 99%- of-the-time sketches, then the guarantee is that 
>> *all
>> possible queries* will fall within the +/- rank error interval defined
>> above.
>>
>> Because of the random sampling process used by the sketch, there is a 1%
>> chance that the *result set* is bad.  We don't know how bad:  it could
>> mess up only one query, or it could mess up all queries from that bad *result
>> set*.
>>
>> This applies to both of your Q1 scenarios, given the caveat above and is
>> actually a stronger guarantee.
>>
>> There is no guarantee on the accuracy of values, however, you can get a
>> sense of the range of values along the value axis that correspond to the
>> confidence interval range along the rank axis.  For example:  Suppose your
>> rank error is 1% and you getQuantile(0.5) which returns the value.  Now you
>> query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the
>> corresponding range of values.
>>
>> Q2. Your Q2 is confusing, because I think you are conflating the accuracy
>> (1/epsilon) with the confidence (1/delta).  The Omega equation is
>> O(sqrt(log(1/delta))/epsilon).
>> There is actually a code comment in the Java version that states:
>>
>>   // constants were derived as the best fit to 99 percentile empirically
>>> measured max error in
>>>   // thousands of trials
>>>   public static double getNormalizedRankError(final int k, final boolean
>>> pmf) {
>>>     return pmf
>>>         ? 2.446 / pow(k, 0.9433)
>>>         : 2.296 / pow(k, 0.9723);
>>>   }
>>
>>
>> The hidden constant values behind the Omega equation above is not
>> discussed in the paper.  So we ran many thousands of trails to empirically
>> determine what those constants are, and the result of that study is the
>> equations given.
>>
>> Q3.  Space is initially limited to 3K, but for very long streams, the
>> space will grow very slowly ...  something along the line of
>> *log_base8(N/k)*.  We need to put growth tables on the web site like we
>> did for the older QuantilesSketch, we just haven't gotten around to it.
>>
>> You are correct that to quickly answer forward queries (like the Q(r)) we
>> build a temporary, internal aux table.  If you have multiple queries from
>> the same sketch this speeds up the query process considerably.  For all the
>> other queries: R(q), CDF, PDF, this aux table is not needed.  Also, when
>> the sketch is serialized, this aux table is not needed as it can be
>> constructed anytime.  You are correct that we could document this temporary
>> size increase better.  However, the speed gain is considerable, and most of
>> our users prefer to have the speed gain at query time and don't worry so
>> much about the temporary allocation of the Aux table.  This is clearly a
>> speed/ space optimization we made as part of the engineering of the design.
>>
>> I hope this helps!
>>
>> Cheers,
>>
>> Lee.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>> Hope you are well in these COVID times.
>>>
>>> I was going through implementation of the KLL algorithm based on the
>>> paper "Streaming Quantiles Algorithms with Small Space and Update Time" and
>>> wanted to understand it more to understand applications in databases.
>>> (cpp version on apache github.)
>>> I had a few questions related to the paper on which I couldn't get my
>>> head around, I would appreciate if I could get some help on these:
>>>
>>> Q1. What is the meaning of confidence in the KLL algorithm?
>>>        Does it mean that once a sketch has been constructed, we start
>>> asking quantile queries. Confidence will tell how many of these queries
>>> will fall under the normalized error bound.
>>>         OR
>>>        If I construct the sketch n times on any data and ask the same
>>> query q each time, confidence will tell in how many cases out of these n
>>> times q will fall under the normalized error bounds.
>>>
>>>        I tested the apache version with deterministic bits, that you
>>> have under KLL_VALIDATION flag.
>>>        There for many queries(different ones) I got values beyond error
>>> bounds, and the number of such queries was way beyond confidence of 99%.
>>>        The deterministic bits followed a pattern (010101..), the random
>>> bit generation could also at some time generate such a pattern.
>>>        What do you do in such a case, where error bounds and confidence
>>> both will not hold?
>>>
>>> Q2. Though the paper mentions a relation between error bound and buffer
>>> size k, as O(sqrt(log e) / e))
>>>         What will be the actual relation between these two wrt constants
>>> as well, when we try to get a practical application of the algorithm.
>>>         In your implementation you have added normalized error to be
>>> equal to 2.296/k^0.9723.
>>>         I want to understand how you got to this value.
>>>
>>> Q3. Space taken by sketch is bounded by 3K, but the algorithm will take
>>> more memory for storing weights to answer quantiles (after sorting across
>>> levels at the end of sketch creation, otherwise bounds might not hold). Why
>>> are these not included in memory taken by sketch?
>>> As a practical application of the paper will need these memories and an
>>> additional tmp storage space as well during cdf construction to answer
>>> queries.
>>>
>>> Appreciate your time.
>>> Thank You,
>>> Gourav Kumar
>>>
>>

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to