Hi,
On 06/20/2015 03:01 AM, Jeff Janes wrote:
Hmmm, that's probably true. OTOH correlated columns are not all that
uncommon (e.g. table storing time-series data etc.), and this blowup
is quite bad ...
True, but we don't know how big of a problem the density-skew
problem might be (since the current algorithm isn't sensitive to it).
It might be just as big of a problem. Tom mentioned some ancient
history in the above mentioned thread that made me think the density
skew was enough of a problem to motivate the current system.
I don't think we need to really assume the density to be constant,
maybe we can verify that while collecting the sample? I mean, we're
already reading the pages, so we can track the density, and either
do the correction or not.
Maybe. I don't know how that would work.
I was thinking about something like
(1) compute average tuples per page
(2) generate random sample of 'samplesize' blocks
(3) for each block, see what is the actual number of tuples on the
page and perform correction (e.g. if there's only 1/2 the average
tuples, toss a coin and decide whether to really sample the block)
(4) repeat until you have sample of the requested size
The first weak spot is that this requires a knowledge of reltuples,
which is something we currently estimate based on the sample, so it
would have to use the old values I guess.
Also, I'm not sure this works that great with blocks that have higher
density - you'd like to choose them with a higher probability, but if
you've already selected them there's no point in repeating that.
We would have to keep two samples, and dynamically decide which to
use. And what if the decision is that both density skew is a problem
and that value clustering is also a problem?
I don't see why we would have to keep two samples? The idea is that
sampling the blocks truly randomly solves the clustering issue, and the
correction I mentioned above solves the density skew problem.
I wonder if the n_distinct could be tweaked so that it counted any
given value only once for each block it finds it in? So instead of
asking "how many times was this value sampled", ask "in how many
blocks was this value sampled at least once"?
I don't think that's a really good idea. Imagine a column with a very
low cardinality, so that every block you sample has just a very few
distinct values. That's going to get very nasty very soon, especially
when combined with estimators assuming a truly random sample (which is
exactly the issue with the current ndistinct estimator).
Also, doesn't Vitter do pretty much the same assumption implicitly,
otherwise it couldn't skipping some of the blocks?
Vitter samples an unstructured stream in a single pass, and is
unbiased. The explicit block sampling is not part of Vitter, it is
something we bolted on top of it.
Yeah, you're right of course. Sorry for not being quite precise. We're
using a variant of the S algorithm, based on the idea that we know the
number of blocks at the beginning - but that's pretty much the place
where we assume uniform density of rows per block, no? Because S is
skipping blocks using that assumption implicitly.
My solution was to just unbolt the block sampling from the top, and let
it sample the rows (still 300 * stats_target of them) from the whole
table rather than a random 300 * 100 blocks of the table. On tables of
the size I was worried about, block sampling was not very helpful
anyway. Reading 30,000 blocks out of 250,000 blocks lead to no
meaningful IO advantage on my hardware. Any advantage from skipped
blocks was made up for (and sometimes exceeded) by fouling up the
read-ahead mechanisms.
>
With 1000 times more blocks, that probably won't work for you.
But I do wonder, how much time does it take to read a random 1/10,
1/100, 1/1000, 1/10000 of a table of your size, just from an IO
perspective? How much are we gaining by doing the block sample?
Well, actually I think it would be even more appropriate for very large
tables. With a 2.5TB table, you don't really care whether analyze
collects 5GB or 8GB sample, the difference is rather minor compared to
I/O generated by the other queries etc. The current sample is already
random enough not to work well with read-ahead, and it scans only a
slightly lower number of blocks.
And if the larger "random" sample results in better plans (especially
plans without OOM) ...
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers