Sorry for waking a dead thread, I had this in my drafts folder that I was cleaning, and wanted to share this so anyone interested can reuse these thoughts.
On Thu, 26 May 2022 at 03:19, Bruce Momjian <br...@momjian.us> wrote: > I read this email today and participated in an unconference session on > this topic today too. Let me explain what I learned. My takeaways from this thread and that unconference session (other notes from the session: [0]): - Lossy count-distinct sketches require at least 100 "buckets" to get a RSE of <10%, and 1000 buckets for RSE <2%. The general formula for RSE for the most popular of these sketches is within a constant factor of 1 / root(m) for m "buckets"- which is theorized to be the theoretical limit for count-distinct sketches that utilize n "buckets". RSE is the residual statistical error, i.e. accuracy of the model, so with the popular sketches to double the accuracy you need 4x as many buckets. A "bucket" is a distinct counting value, e.g. the log-counters in (H)LL, and bits in HyperBitBit. Most sketches use anywhere from several bits to several bytes per bucket: HLL uses 5 and 6 bits for 32- and 64-bits hashes, respectively, - If we will implement sketches, it would be preferred if they support the common set operations of [join, intersect, imply] while retaining their properties, so that we don't lose the increased accuracy. HLL does not support intersect- or imply-operations directly, which makes it a bad choice as an estimator of rows returned. - Bitmaps would be a good first implementation for an initial sketch-based statistics implementation Assuming the implementation would compress these bitmaps, the average size would definitely not be larger than 900 bytes (+ administrative overhead) per MCV entry - 3 bytes per sampled row for the index in the bitmap * 300 rows on average per MCV entry. Rows would be identified by their index in the sampled list of rows. - It is not clear we need to be able to combine statistics from multiple runs of ANALYZE. \We considered that it is rare for people to analyze only a subset of columns, or that people otherwise would need to combine analytics from distinct table samples of the same table. - Accurately combining statistics from two different runs of ANALYZE requires stable sampling, and lossless tuple identification The above-mentioned bitmap-based statistics would not allow this, because the index of a sampled row will likely shift between runs, even assuming that a row is shared in the sample. Summary: We shouldn't use HLL, (compressed) bitmaps will work fine for an initial implementation of combined sketch-based MCV estimates. Kind regards, Matthias van de Meent [0] https://wiki.postgresql.org/wiki/PgCon_2022_Developer_Unconference#Improving_Statistics_Accuracy