Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: > I will be pleased if somebody (first of all Robert) can comment me > "strange" results of distinct values estimation.
Estimating the number of distinct values from a small sample is a hard problem; every algorithm is going to blow it in some cases. > In my case there are no null values: > totalrows = 48980014 > samplerows = 30000 > ndistinct = 29135 > nmultiple = 800 You seem to be using the default statistics target. Possibly raising that would give better answers for this column. > May be instead of sampling-based estimation use streaming > based estimation, for example HyperLogLog algorithm already present in > Postgres? Maybe, but I'd be really surprised if HLL were any sort of magic bullet. I think it's intended to make an ndistinct estimate with just a small amount of state preserved as rows are scanned, which is an admirable goal but not one that ANALYZE particularly needs. Adopting such a constraint when we don't need it does not sound like a recipe for getting a better final answer. regards, tom lane