Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

Matthias van de Meent Wed, 06 Jul 2022 08:29:24 -0700

Sorry for waking a dead thread, I had this in my drafts folder that I
was cleaning, and wanted to share this so anyone interested can reuse
these thoughts.

On Thu, 26 May 2022 at 03:19, Bruce Momjian <br...@momjian.us> wrote:
> I read this email today and participated in an unconference session on
> this topic today too. Let me explain what I learned.

My takeaways from this thread and that unconference session (other
notes from the session: [0]):

- Lossy count-distinct sketches require at least 100 "buckets" to get
a RSE of <10%, and 1000 buckets for RSE <2%.
The general formula for RSE for the most popular of these sketches is
within a constant factor of 1 / root(m) for m "buckets"- which is
theorized to be the theoretical limit for count-distinct sketches that
utilize n "buckets".
RSE is the residual statistical error, i.e. accuracy of the model, so
with the popular sketches to double the accuracy you need 4x as many
buckets.
A "bucket" is a distinct counting value, e.g. the log-counters in
(H)LL, and bits in HyperBitBit.
Most sketches use anywhere from several bits to several bytes per
bucket: HLL uses 5 and 6 bits for 32- and 64-bits hashes,
respectively,

- If we will implement sketches, it would be preferred if they support
the common set operations of [join, intersect, imply] while retaining
their properties, so that we don't lose the increased accuracy.
HLL does not support intersect- or imply-operations directly, which
makes it a bad choice as an estimator of rows returned.

- Bitmaps would be a good first implementation for an initial
sketch-based statistics implementation
Assuming the implementation would compress these bitmaps, the average
size would definitely not be larger than 900 bytes (+ administrative
overhead) per MCV entry - 3 bytes per sampled row for the index in the
bitmap * 300 rows on average per MCV entry. Rows would be identified
by their index in the sampled list of rows.

- It is not clear we need to be able to combine statistics from
multiple runs of ANALYZE.
\We considered that it is rare for people to analyze only a subset of
columns, or that people otherwise would need to combine analytics from
distinct table samples of the same table.

- Accurately combining statistics from two different runs of ANALYZE
requires stable sampling, and lossless tuple identification
The above-mentioned bitmap-based statistics would not allow this,
because the index of a sampled row will likely shift between runs,
even assuming that a row is shared in the sample.

Summary:
We shouldn't use HLL, (compressed) bitmaps will work fine for an
initial implementation of combined sketch-based MCV estimates.

Kind regards,

Matthias van de Meent

[0]
https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Improving_Statistics_Accuracy

Re: [RFC] Improving multi-column filter cardinality estimation using MCVs and HyperLogLog

Reply via email to