Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2013-01-14 Thread Peter Geoghegan
On 14 January 2013 07:35, Stefan Andreatta wrote: > The source of these troubles is the sampling method employed in > src/backend/commands/analyze.c. Judging from Tom Lane's comment for the > original implementation in 2004 this has never been thought to be perfect. > Does anybody see a chance to

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2013-01-13 Thread Stefan Andreatta
A status update on this problem: 1.) Workarounds (setting n_distinct manually) are tested and - as far as workarounds go - OK. 2.) Source of the problem and possible solution: The source of these troubles is the sampling method employed in src/backend/commands/analyze.c. Judging from Tom Lan

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2013-01-03 Thread Stefan Andreatta
On 12/29/2012 10:57 PM, Peter Geoghegan wrote: On 29 December 2012 20:57, Stefan Andreatta wrote: ... The general advice here is: 1) Increase default_statistics_target for the column. I tried that, but to get good estimates under these circumstances, I need to set the statistics_target so

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-30 Thread Claudio Freire
On Sat, Dec 29, 2012 at 5:57 PM, Stefan Andreatta wrote: > n*d / (n - f1 + f1*n/N) > > where f1 is the number of values that occurred only once in the sample. n is > the number of rows sampled, d the number of distincts found and N the total > number of rows in the table. > ... > > When the numbe

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-30 Thread Stefan Andreatta
On 12/29/2012 10:57 PM, Peter Geoghegan wrote: On 29 December 2012 20:57, Stefan Andreatta wrote: Now, the 2005 discussion goes into great detail on the advantages and disadvantages of this algorithm, particularly when using small sample sizes, and several alternatives are discussed. I do not k

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-29 Thread Peter Geoghegan
On 29 December 2012 20:57, Stefan Andreatta wrote: > Now, the 2005 discussion goes into great detail on the advantages and > disadvantages of this algorithm, particularly when using small sample sizes, > and several alternatives are discussed. I do not know whether anything has > been changed afte

[PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-29 Thread Stefan Andreatta
I have encountered serious under-estimations of distinct values when values are not evenly distributed but clustered within a column. I think this problem might be relevant to many real-world use cases and I wonder if there is a good workaround or possibly a programmatic solution that could be