Greg, > Speak to the statisticians. Our sample size is calculated using the same > theory behind polls which sample 600 people to learn what 250 million > people are going to do on election day. You do NOT need (significantly) > larger samples for larger populations.
Your analogy is bad. For elections, the voters have only a few choices. In a 300 million row table, there could be 300 million different values, and the histogram becomes less accurate for every order of magnitude smaller than 300 million it is. > Also, our estimates for n_distinct are very unreliable. The math behind > sampling for statistics just doesn't work the same way for properties > like n_distinct. For that Josh is right, we *would* need a sample size > proportional to the whole data set which would practically require us to > scan the whole table (and have a technique for summarizing the results > in a nearly constant sized data structure). Actually, a number of papers have shown block-based algorithms which can arrive a reasonably confident (between 50% and 250% of accurate) estimates based on scanning only 5% of *blocks*. Simon did some work on this a couple years ago, but he and I had difficultly convincing -hackers that a genuine problem existed. You're correct that we'd need to change pg_statistic, though. For one thing, we need to separate the sample size from the histogram size. Also, we seem to be getting pretty far away from the original GUC discussion. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers