Greg, > The analogous case in our situation is not having 300 million distinct > values, since we're not gathering info on specific values, only the > buckets. We need, for example, 600 samples *for each bucket*. Each bucket > is chosen to have the same number of samples in it. So that means that we > always need the same number of samples for a given number of buckets.
I think that's plausible. The issue is that in advance of the sampling we don't know how many buckets there *are*. So we first need a proportional sample to determine the number of buckets, then we need to retain a histogram sample proportional to the number of buckets. I'd like to see someone with a PhD in this weighing in, though. > Really? Could you send references? The paper I read surveyed previous work > and found that you needed to scan up to 50% of the table to get good > results. 50-250% is considerably looser than what I recall it considering > "good" results so these aren't entirely inconsistent but I thought previous > results were much worse than that. Actually, based on my several years selling performance tuning, I generally found that as long as estimates were correct within a factor of 3 (33% to 300%) the correct plan was generally chosen. There are papers on block-based sampling which were already cited on -hackers; I'll hunt through the archives later. -- Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers