David Fetter <[EMAIL PROTECTED]> writes: > > In the prior discussions someone posted the paper with the algorithm > > I mentioned. That paper mentions that previous work showed poor > > results at estimating n_distinct even with sample sizes as large as > > 50% or more. > > Which paper? People have referenced several different ones.
Phillip B Gibbons. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. Which says (emphasis in original as italics): The most well-studied approach for distinct-values estimation is to collect a uniform random sample S of the data, store S in a database, and then use S at query time to provide fast, approximate answers to distinct values queries [22, 23, 27, 21, 29, 5, 28, 18, 19, 9, 7]. However, previous work [28, 18, 9, 7] has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data), even for the simple case of counting the number of distinct values in a column. The strongest negative result is due to Charikar et al. [7], who proved that estimating the number of distinct values in a column to within a small constant factor (with probability > 1/2) requires that *nearly* *the* *entire* *data* *set* *be* *sampled*. Moreover, all known sampling-based estimators provide unsatisfactory results on data sets of interest [7], even for this simple case. And later: Using a variety of synthetic and real-world data sets, we show that distinct sampling gives estimates for distinct values queries that are within 0%-10%, whereas previous methods were typically 50%-250% off, across the spectrum of data sets and queries studied. Here "distinct sampling" is the algorithm being proposed which requires looking at every record and keeping a sample *of the distinct values*. The "previous methods" are methods based on sampling the records I haven't read the citation [7] that proves the strong negative result for any estimator of distinct values based on sampling. It's: M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. 19th ACM Symp. on Principles of Database Systems, pages 268?279, May 2000. > > Hopefully you're right that you don't need complete histograms. > > Perhaps there's some statistics concept they don't teach in stats > > 101 that would cover this need. What we're looking for is a > > function f(a,b,x) that gives the net selectivity given a and b, the > > selectivity of two clauses based on two columns, and x some simple > > value that can be easily calculated by looking at the data in > > advance. > > That would be neat :) Doing a bit of basic searching around I think the tool we're looking for here is called a "chi-squared test for independence". I haven't read up on how it works so I'm unclear if i it be calculated using a simple O(n) scan or if it would require some non-linear post-processing after the analyze pass which would be unfortunate. And I haven't found anything that describes how to make use of the resulting number. Does it actually give a formula f(a,b,x) that gives a nice convenient expected selectivity for the clause? Oh, and incidentally, at first glance it seems like calculating it depends on having good distinct value sampling. -- greg ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend