Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Tomas Vondra Sat, 20 Jun 2015 09:56:59 -0700

Hi,

On 06/20/2015 03:01 AM, Jeff Janes wrote:


    Hmmm, that's probably true. OTOH correlated columns are not all that
    uncommon (e.g. table storing time-series data etc.), and this blowup
    is quite bad ...


True, but we don't know how big of a problem the density-skew
problem might be (since the current algorithm isn't sensitive to it).
It might be just as big of a problem. Tom mentioned some ancient
history in the above mentioned thread that made me think the density
skew was enough of a problem to motivate the current system.


    I don't think we need to really assume the density to be constant,
    maybe we can verify that while collecting the sample? I mean, we're
    already reading the pages, so we can track the density, and either
    do the correction or not.


Maybe.  I don't know how that would work.


I was thinking about something like

  (1) compute average tuples per page

  (2) generate random sample of 'samplesize' blocks

  (3) for each block, see what is the actual number of tuples on the
      page and perform correction (e.g. if there's only 1/2 the average
      tuples, toss a coin and decide whether to really sample the block)

  (4) repeat until you have sample of the requested size

The first weak spot is that this requires a knowledge of reltuples,which is something we currently estimate based on the sample, so itwould have to use the old values I guess.

Also, I'm not sure this works that great with blocks that have higherdensity - you'd like to choose them with a higher probability, but ifyou've already selected them there's no point in repeating that.

We would have to keep two samples, and dynamically decide which to
use. And what if the decision is that both density skew is a problem
and that value clustering is also a problem?

I don't see why we would have to keep two samples? The idea is thatsampling the blocks truly randomly solves the clustering issue, and thecorrection I mentioned above solves the density skew problem.

I wonder if the n_distinct could be tweaked so that it counted any
given value only once for each block it finds it in? So instead of
asking "how many times was this value sampled", ask "in how many
blocks was this value sampled at least once"?

I don't think that's a really good idea. Imagine a column with a verylow cardinality, so that every block you sample has just a very fewdistinct values. That's going to get very nasty very soon, especiallywhen combined with estimators assuming a truly random sample (which isexactly the issue with the current ndistinct estimator).


    Also, doesn't Vitter do pretty much the same assumption implicitly,
    otherwise it couldn't skipping some of the blocks?


Vitter samples an unstructured stream in a single pass, and is
unbiased.  The explicit block sampling is not part of Vitter, it is
something we bolted on top of it.

Yeah, you're right of course. Sorry for not being quite precise. We'reusing a variant of the S algorithm, based on the idea that we know thenumber of blocks at the beginning - but that's pretty much the placewhere we assume uniform density of rows per block, no? Because S isskipping blocks using that assumption implicitly.


My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table.  On tables of
the size I was worried about, block sampling was not very helpful
anyway.  Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.

With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective?  How much are we gaining by doing the block sample?

Well, actually I think it would be even more appropriate for very largetables. With a 2.5TB table, you don't really care whether analyzecollects 5GB or 8GB sample, the difference is rather minor compared toI/O generated by the other queries etc. The current sample is alreadyrandom enough not to work well with read-ahead, and it scans only aslightly lower number of blocks.

And if the larger "random" sample results in better plans (especiallyplans without OOM) ...


--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Reply via email to