On 08/12/13 09:25, Josh Berkus wrote:
On 12/07/2013 11:46 AM, Robert Haas wrote:
Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not.  So what I
think we should do is auto-tune the statistics target based on the
table size.  If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.

This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables.  We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.

There's other qualitative improvements we could make, which Nathan Boley
has spoken on.   For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed.  If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.


From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method:  Stage one selects up
 * to targrows random blocks (or all blocks, if there aren't so many).
 * Stage two scans these blocks and uses the Vitter algorithm to create
 * a random sample of targrows rows (or less, if there are less in the
 * sample of blocks).  The two stages are executed simultaneously: each
 * block is processed as soon as stage one returns its number and while
 * the rows are read stage two controls which ones are to be inserted
 * into the sample.

I don't think we always read every block (been a while since I looked at this code, so I'll recheck). From what I understand this stuff is based on reasonable research (Vitter algorithm). Not to say it couldn't/shouldn't be looked at again to improve it - but it is not just dreamed up with no valid research!

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to