> -----Original Message----- > From: Gurmeet Manku [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 26, 2005 5:01 PM > To: Simon Riggs > Cc: Tom Lane; josh@agliodbs.com; Greg Stark; Marko Ristola; > pgsql-perform; pgsql-hackers@postgresql.org; Utkarsh Srivastava; > [EMAIL PROTECTED] > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct value > queries and event reports" by Gibbons, VLDB 2001. > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > [...]
This paper looks the most promising, and isn't too different from what I suggested about collecting stats over the whole table continuously. What Gibbons does is give a hard upper bound on the sample size by using a logarithmic technique for storing sample information. His technique appears to offer very good error bounds and confidence intervals as shown by tests on synthetic and real data. I think it deserves a hard look from people hacking the estimator. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend