Greg Stark wrote:
Josh Berkus <josh@agliodbs.com> writes:
Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that
are within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and queries studied.
Aha. It's a question of the level of error permissable. For our
estimates, being 100% off is actually OK. That's why I was looking at 5%
block sampling; it stays within the range of +/- 50% n-distinct in 95% of
cases.
Well, let's see. Say for example we're scanning 50,000 records out of 1M
records. Using the Theorem from Charikar et al quoted as Theorem 1 in section
2 we get that there is at least a 5% chance of getting a result with a ratio
error at least sqrt((1M-50k)/100k ln(20)) or 5.33.
So no, even assuming you have a good unbiased sample, a 5% sample is only
going to get you to a result within 19%-533% of the real values 95% of the
time. Not 50-250%. And that's a negative result, so it's *at best* 19-533%. On
some distributions it may be outside that range even more than 95% of the
time.
This is entirely unlike the histograms where we have a solid foundation for a
positive result. We can guarantee that the result will be outside +/- x% *at
most* 5% of the time. For most distributions it'll be outside that range even
less.
And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
5% block sampling took just as long as reading all the blocks. Even if we
figure out what's causing that (IMHO surprising) result and improve matters I
would only expect it to be 3-4x faster than a full scan.
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00285.php
I'm sorry to interrupt your esoteric (to me) discussion, but I have a
very simple question: would you define a "good unbiased sample"? My
statistics professor Dan Price (rest his soul) would tell me there are
only random samples of some sort, and "other", which are always biased,
and never good. Excuse my absolutes, I didn't mean Good or Evil. And
how do you calculate a level of error without randomization? And are you
talking of type I or type II error?
Michael
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend