> The simple truth is > > 1) sampling-based estimators are a dead-end
While I don't want to discourage you from working on steam-based estimators ... I'd love to see you implement a proof-of-concept for PostgreSQL, and test it ... the above is a non-argument. It requires us to accept that sample-based estimates cannot ever be made to work, simply because you say so. The Charikar and Chaudhuri paper does not, in fact, say that it is impossible to improve sampling-based estimators as you claim it does. In fact, the authors offer several ways to improve sampling-based estimators. Further, 2000 was hardly the end of sampling-estimation paper publication; there are later papers with newer ideas. For example, I still think we could tremendously improve our current sampling-based estimator without increasing I/O by moving to block-based estimation*. The accuracy statistics for block-based samples of 5% of the table look quite good. I would agree that it's impossible to get a decent estimate of n-distinct from a 1% sample. But there's a huge difference between 5% or 10% and "a majority of the table". Again, don't let this discourage you from attempting to write a steam-based estimator. But do realize that you'll need to *prove* its superiority, head-to-head, against sampling-based estimators. [* http://www.jstor.org/pss/1391058 (unfortunately, no longer public-access)] -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers