date:20121229

[PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-29 Thread Stefan Andreatta

I have encountered serious under-estimations of distinct values when values are not evenly distributed but clustered within a column. I think this problem might be relevant to many real-world use cases and I wonder if there is a good workaround or possibly a programmatic solution that could be

Re: [PERFORM] serious under-estimation of n_distinct for clustered distributions

2012-12-29 Thread Peter Geoghegan

On 29 December 2012 20:57, Stefan Andreatta wrote: > Now, the 2005 discussion goes into great detail on the advantages and > disadvantages of this algorithm, particularly when using small sample sizes, > and several alternatives are discussed. I do not know whether anything has > been changed afte