Re: [HACKERS] ANALYZE sampling is too good

Mark Kirkwood Sun, 08 Dec 2013 14:20:50 -0800

On 09/12/13 08:03, Josh Berkus wrote:


So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block.  That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are.  And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample.  Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample.  However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

This makes more sense! Unfortunately you had confused everyone (well meat least) by decreeing that we needed block based sampling - when westarted using that in 2004 - there is more than one way to sample blocksit seems :-)

What you are actually suggesting is a way to do block based samplingthat requires reading fewer blocks, and that is a great idea! Of courseas Greg has just suggested - the details are gonna be the clincher. Canyou supply authors or titles of papers?

Also with reference to Greenplum - their use of postgres is fairlyspecial case, and an approach that works well for them is not alwayssuitable for more general purpose use.


Regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Reply via email to