Greg, > We *currently* use a block based sampling algorithm that addresses this > issue by taking care to select rows within the selected blocks in an > unbiased way. You were proposing reading *all* the records from the > selected blocks, which throws away that feature.
The block-based algorithms have specific math to cope with this. Stuff which is better grounded in statistical analysis than our code. Please read the papers before you judge the solution. > Worse, my recollection from the paper I mentioned earlier was that > sampling small percentages like 3-5% didn't get you an acceptable > accuracy. Before you got anything reliable you found you were sampling > very large percentages of the table. And note that if you have to sample > anything over 10-20% you may as well just read the whole table. Random > access reads are that much slower. Right, which is why researchers are currently looking for something better. The Brutlag & Richardson claims to be able to produce estimates which are within +/- 3x 90% of the time using a 5% sample, which is far better than our current accuracy. Nobody claims to be able to estimate based on < 0.1% of the table, which is what Postgres tries to do for large tables. 5% based on block-based sampling is reasonable; that means a straight 5% of the on-disk size of the table, so 5gb for a 100gb table. With random-row sampling, that would require as much as 25% of the table, making it easier to just scan the whole thing. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend