Re: n_distinct off by a factor of 1000

2020-06-28 Thread Peter J. Holzer
On 2020-06-24 16:27:35 -0600, Michael Lewis wrote: > On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer wrote: > > Yes, estimating the number of distinct values from a relatively small > sample is hard when you don't know the underlying distribution. It might > be possible to analyze the s

Re: n_distinct off by a factor of 1000

2020-06-27 Thread Klaudie Willis
> I am not sure whether I am misinterpreting something, or if it is a bug > (probably former) however, when using partitions, are not > n_distinct_inherited supposed to propagate to the child partitions? It does > not seem to do so. (Yes, I run Analyze after setting the variable) I had to > set

Re: n_distinct off by a factor of 1000

2020-06-27 Thread Klaudie Willis
On the topic of n_distinct. I am not sure whether I am misinterpreting something, or if it is a bug (probably former) however, when using partitions, are not n_distinct_inherited supposed to propagate to the child partitions? It does not seem to do so. (Yes, I run Analyze after setting the vari

Re: n_distinct off by a factor of 1000

2020-06-25 Thread Michael Lewis
On Thu, Jun 25, 2020 at 7:27 AM Pavel Luzanov wrote: > I have tried to increase the statistics target to 5000, and it helps, but > it reduces the error to 100X. Still crazy high. > > > As far as I know, increasing default_statistics_target will not help. [1] > > I have considered these fixes: >

Re: n_distinct off by a factor of 1000

2020-06-25 Thread Pavel Luzanov
Hello, I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be: The default_statistics_target=500, and analyze has been run. select * from pg_stats where attname like 'instr%_ref'; -- Result

Re: n_distinct off by a factor of 1000

2020-06-25 Thread Klaudie Willis
> If we could increase the sampling ratio beyond the hard coded 300x to get a > more representative sample and use that to estimate ndistinct (and also the > frequency of the most common values) but only actually stored the 100 MCVs > (or whatever the stats target is set to for the system or col

Re: n_distinct off by a factor of 1000

2020-06-24 Thread Michael Lewis
On Wed, Jun 24, 2020, 2:35 PM Peter J. Holzer wrote: > Yes, estimating the number of distinct values from a relatively small > sample is hard when you don't know the underlying distribution. It might > be possible to analyze the sample to find the distribution and get a > better estimate. But I'm

Re: n_distinct off by a factor of 1000

2020-06-24 Thread Peter J. Holzer
[Please keep replies on the list] On 2020-06-24 11:02:22 +, Klaudie Willis wrote: > Holzer, thanks for your feedback. Yes, your guess is very good. The > data consists of millions of instruments that occur a handful of cases > (rare), and instruments that are very common. > > I am still a

Re: n_distinct off by a factor of 1000

2020-06-24 Thread Peter J. Holzer
On 2020-06-24 07:30:05 +, Klaudie Willis wrote: > show default_statistics_target; --> 500 > ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000; > > Here is the output of the "ANALYZE VERBOSE bigtable;" > INFO: analyzing "public.bigtables" inheritance tree [...] > IN

Re: n_distinct off by a factor of 1000

2020-06-24 Thread Klaudie Willis
show default_statistics_target; --> 500 ALTER TABLE public.bigtable ALTER COLUMN instrumentid_ref SET STATISTICS 5000; Here is the output of the "ANALYZE VERBOSE bigtable;" INFO: analyzing "public.bigtables" inheritance tree INFO: "bigtable_y2018": scanned 622250 of 10661013 pages, containing 1199

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Michael Lewis
> > > > On 23/06/2020 14:42, Klaudie Willis wrote: > > > > > > > I got my first hint of why this problem occurs when I looked at the > > > > statistics. For the column in question, "instrument_ref" the > > > > statistics claimed it to be: > > > > The default_statistics_target=500, and analyze has

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Klaudie Willis
Adrian, you are correct. My mistanke. K ‐‐‐ Original Message ‐‐‐ On Tuesday, June 23, 2020 4:14 PM, Adrian Klaver wrote: > On 6/23/20 7:05 AM, Fabio Pardi wrote: > > > On 23/06/2020 14:42, Klaudie Willis wrote: > > > > > I got my first hint of why this problem occurs when I looked at

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Adrian Klaver
On 6/23/20 7:05 AM, Fabio Pardi wrote: On 23/06/2020 14:42, Klaudie Willis wrote: I got my first hint of why this problem occurs when I looked at the statistics.  For the column in question, "instrument_ref" the statistics claimed it to be: The default_statistics_target=500, and analyze has

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Fabio Pardi
On 23/06/2020 14:42, Klaudie Willis wrote: > I got my first hint of why this problem occurs when I looked at the > statistics.  For the column in question, "instrument_ref" the statistics > claimed it to be: > > The default_statistics_target=500, and analyze has been run. > select * from pg_stat

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Klaudie Willis
I didn't run it with "verbose" but otherwise, yes, several times. I can do it again with verbose if you are interested in the output. Just give me some time. 500M rows 50 columns, is no small job :) K ‐‐‐ Original Message ‐‐‐ On Tuesday, June 23, 2020 2:51 PM, Ron wrote: > Maybe I mis

Re: n_distinct off by a factor of 1000

2020-06-23 Thread Ron
Maybe I missed it, but did you run "ANALYZE VERBOSE bigtable;"? On 6/23/20 7:42 AM, Klaudie Willis wrote: Friends, I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data.  I have a large table, 500M rows, 50 columns. It is split in 3

n_distinct off by a factor of 1000

2020-06-23 Thread Klaudie Willis
Friends, I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data. I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year. In addition to the primary key, one of the columns is indexed, and I do lookups on this.