On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote: > Josh Berkus <josh@agliodbs.com> writes: > > > > These numbers don't make much sense to me. It seems like 5% is about as > > > slow as reading the whole file which is even worse than I expected. I > > > thought I was being a bit pessimistic to think reading 5% would be as > > > slow as reading 20% of the table. > > > > It's about what *I* expected. Disk seeking is the bane of many access > > methods. > > Sure, but that bad? That means realistic random_page_cost values should be > something more like 20 rather than 4. And that's with seeks only going to > subsequent blocks in a single file, which one would expect to average less > than the half rotation that a random seek would average. That seems worse than > anyone expects. > > > Anyway, since the proof is in the pudding, Simon and I will be working on > > some demo code for different sampling methods so that we can debate > > results rather than theory. > > Note that if these numbers are realistic then there's no i/o benefit to any > sampling method that requires anything like 5% of the entire table and is > still unreliable. Instead it makes more sense to implement an algorithm that > requires a full table scan and can produce good results more reliably. > > -- > greg >
I have not looked closely at the sampling process in PostgreSQL, but given that getting an accurate answer is expensive in terms of read/seeks, would it be possible to iteratively refine the estimate overtime? Ideally we could use the same disk blocks that are already being read to satisfy a query. The initial estimate could be done with a minimal impact and then we piggy- back the I/O needed to refine the estimate on the normal query I/O. The obvious limit would be if the entire table were scanned we would get a 100% exact estimate of the distict value at that time, and otherwise a progressively more accurate approximation as queries are run. This would allow us to hide the I/O in other normal block accesses. Ken ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly