Re: Regarding error bounds and confidence of apache KLL implementation

leerho Mon, 22 Jun 2020 18:15:24 -0700

Hello Gourav, welcome to this forum!

I want to make sure you have access to and have read the code documentation
for the KLL sketch in addition to the papers.  Although the code
documentation exists for both Java and C++, it is a little easier to access
the Javadocs as they are accessible from the website.  There is a long section
about the KLL sketch
<https://datasketches.apache.org/api/java/snapshot/apidocs/index.html> and
its implementation that may answer some of your questions.

It is important to emphasize that the error of the quantile sketch is only
relevant to the rank and not the value. It is not clear from your question,
which you are referring to.

1. I asked Justin Thaler, one of our scientists, how to better explain the
error "guarantee" on the sketch.  Hopefully, he can respond here if what I
write here is not totally correct :)

When you feed a stream of values into a KLL (or Quantiles) sketch the
result is actually a set of values (kept internally by the sketch) that
have been "sampled" from the stream using a random process. Let's call this
set a "*result set*".  When we issue a query against the sketch it analyzes
this *result set* and returns a value (or a rank depending on the query).

The guarantee goes something like this:

With 99% probability, the *result set* of a sketch will be a set such that
a *getQuantile(rank)* query from that *result set* will produce an
*estimated quantile
(or value)* such that the *true rank of the true value* will be within +/-
the absolute fractional rank error returned by the function
getNormalizedRankError(k, pmf).  If the *result set* is a good one, i.e.,
one of the good 99%- of-the-time sketches, then the guarantee is that *all
possible queries* will fall within the +/- rank error interval defined
above.

Because of the random sampling process used by the sketch, there is a 1%
chance that the *result set* is bad.  We don't know how bad:  it could mess
up only one query, or it could mess up all queries from that bad *result
set*.

This applies to both of your Q1 scenarios, given the caveat above and is
actually a stronger guarantee.

There is no guarantee on the accuracy of values, however, you can get a
sense of the range of values along the value axis that correspond to the
confidence interval range along the rank axis.  For example:  Suppose your
rank error is 1% and you getQuantile(0.5) which returns the value.  Now you
query getQuantile (.5 -.01) and getQuantile (.5+.01) and that will give the
corresponding range of values.

Q2. Your Q2 is confusing, because I think you are conflating the accuracy
(1/epsilon) with the confidence (1/delta).  The Omega equation is
O(sqrt(log(1/delta))/epsilon).
There is actually a code comment in the Java version that states:

  // constants were derived as the best fit to 99 percentile empirically
> measured max error in
>   // thousands of trials
>   public static double getNormalizedRankError(final int k, final boolean
> pmf) {
>     return pmf
>         ? 2.446 / pow(k, 0.9433)
>         : 2.296 / pow(k, 0.9723);
>   }

The hidden constant values behind the Omega equation above is not discussed
in the paper.  So we ran many thousands of trails to empirically determine
what those constants are, and the result of that study is the equations
given.

Q3.  Space is initially limited to 3K, but for very long streams, the space
will grow very slowly ...  something along the line of *log_base8(N/k)*.
We need to put growth tables on the web site like we did for the older
QuantilesSketch, we just haven't gotten around to it.

You are correct that to quickly answer forward queries (like the Q(r)) we
build a temporary, internal aux table.  If you have multiple queries from
the same sketch this speeds up the query process considerably.  For all the
other queries: R(q), CDF, PDF, this aux table is not needed.  Also, when
the sketch is serialized, this aux table is not needed as it can be
constructed anytime.  You are correct that we could document this temporary
size increase better.  However, the speed gain is considerable, and most of
our users prefer to have the speed gain at query time and don't worry so
much about the temporary allocation of the Aux table.  This is clearly a
speed/ space optimization we made as part of the engineering of the design.

I hope this helps!

Cheers,

Lee.

On Sun, Jun 21, 2020 at 11:13 AM Gourav Kumar <gourav1...@gmail.com> wrote:

> Hi All,
> Hope you are well in these COVID times.
>
> I was going through implementation of the KLL algorithm based on the paper
> "Streaming Quantiles Algorithms with Small Space and Update Time" and
> wanted to understand it more to understand applications in databases.
> (cpp version on apache github.)
> I had a few questions related to the paper on which I couldn't get my head
> around, I would appreciate if I could get some help on these:
>
> Q1. What is the meaning of confidence in the KLL algorithm?
>        Does it mean that once a sketch has been constructed, we start
> asking quantile queries. Confidence will tell how many of these queries
> will fall under the normalized error bound.
>         OR
>        If I construct the sketch n times on any data and ask the same
> query q each time, confidence will tell in how many cases out of these n
> times q will fall under the normalized error bounds.
>
>        I tested the apache version with deterministic bits, that you have
> under KLL_VALIDATION flag.
>        There for many queries(different ones) I got values beyond error
> bounds, and the number of such queries was way beyond confidence of 99%.
>        The deterministic bits followed a pattern (010101..), the random
> bit generation could also at some time generate such a pattern.
>        What do you do in such a case, where error bounds and confidence
> both will not hold?
>
> Q2. Though the paper mentions a relation between error bound and buffer
> size k, as O(sqrt(log e) / e))
>         What will be the actual relation between these two wrt constants
> as well, when we try to get a practical application of the algorithm.
>         In your implementation you have added normalized error to be equal
> to 2.296/k^0.9723.
>         I want to understand how you got to this value.
>
> Q3. Space taken by sketch is bounded by 3K, but the algorithm will take
> more memory for storing weights to answer quantiles (after sorting across
> levels at the end of sketch creation, otherwise bounds might not hold). Why
> are these not included in memory taken by sketch?
> As a practical application of the paper will need these memories and an
> additional tmp storage space as well during cdf construction to answer
> queries.
>
> Appreciate your time.
> Thank You,
> Gourav Kumar
>

Re: Regarding error bounds and confidence of apache KLL implementation

Reply via email to