On 09/12/13 08:03, Josh Berkus wrote:
So there's a set of math designed to calculate for the skew introduced by reading *all* of the rows in each block. That's what I meant by "block-based sampling"; you read, say, 400 pages, you compile statistics on *all* of the rows on those pages, you apply some algorithms to adjust for groupings of rows based on how grouped they are. And you have a pretty good estimate of how grouped they are, because you just looked a complete sets of rows on a bunch of nonadjacent pages. Obviously, you need to look at more rows than you would with a pure-random sample. Like I said, the 80%+ accurate point in the papers seemed to be at a 5% sample. However, since those rows come from the same pages, the cost of looking at more rows is quite small, compared to the cost of looking at 64 times as many disk pages. My ACM subscription has lapsed, though; someone with a current ACM subscription could search for this; there are several published papers, with math and pseudocode.
This makes more sense! Unfortunately you had confused everyone (well me at least) by decreeing that we needed block based sampling - when we started using that in 2004 - there is more than one way to sample blocks it seems :-)
What you are actually suggesting is a way to do block based sampling that requires reading fewer blocks, and that is a great idea! Of course as Greg has just suggested - the details are gonna be the clincher. Can you supply authors or titles of papers?
Also with reference to Greenplum - their use of postgres is fairly special case, and an approach that works well for them is not always suitable for more general purpose use.
Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers