Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Greg Stark
On Wed, Oct 20, 2010 at 6:41 PM, Josh Berkus wrote: > A while back I did a fair bit of reading on ndistinct and large tables > from the academic literature.  The consensus of many papers was that it > took a sample of at least 3% (or 5% for block-based) of the table in > order to have 95% confiden

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Tom Lane
Josh Berkus writes: > A while back I did a fair bit of reading on ndistinct and large tables > from the academic literature. The consensus of many papers was that it > took a sample of at least 3% (or 5% for block-based) of the table in > order to have 95% confidence in ndistinct of 3X. I can't

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Joshua D. Drake
On Wed, 2010-10-20 at 15:15 -0700, Josh Berkus wrote: > >> Maybe what should be done about this is to have separate sizes for the > >> MCV list and the histogram, where the MCV list is automatically sized > >> during ANALYZE. > > It's been suggested multiple times that we should base our sample si

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
> That one's used, too, but the other is used as an upper bound. > n_distinct tends to come out too small on large tables, so that > formula is prone to overestimation.  Actually, both formulas are prone > to overestimation. > Right - thanks. > When this happens depends on the values of a whole b

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Robert Haas
On Wed, Oct 20, 2010 at 9:53 PM, Nathan Boley wrote: >> Robert explained why having more MCVs might be useful because we use >> the frequency of the least common MCV as an upper bound on the >> frequency of any value in the MCV. > > Where is that being used? var_eq_const > The only non-MCV frequ

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Nathan Boley
> Robert explained why having more MCVs might be useful because we use > the frequency of the least common MCV as an upper bound on the > frequency of any value in the MCV. Where is that being used? The only non-MCV frequency estimate that I recall seeing is ( nrows - n_ndistinct_rows )/ndistinct

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Robert Haas
On Wed, Oct 20, 2010 at 7:13 PM, Greg Stark wrote: > On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas wrote: >> Yes, I think a percentage of the table is going to break down either >> at the high end or the low end.  Hand-waving (but based on >> experience), for a 1000 row table a statistics target o

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Josh Berkus
> I don't see why the MCVs would need a particularly large sample size > to calculate accurately. Have you done any tests on the accuracy of > the MCV list? Yes, although I don't have them at my fingertips. In sum, though, you can't take 10,000 samples from a 1b row table and expect to get a rem

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Greg Stark
On Wed, Oct 20, 2010 at 6:03 PM, Josh Berkus wrote: > I also just realized that I confused myself ... we don't really want > more MCVs.  What we want it more *samples* to derive a small number of > MCVs.  Right now # of samples and number of MCVs is inexorably bound, > and they shouldn't be.  On l

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Josh Berkus
> Why? Afaict this has been suggested multiple times by people who don't > justify it in any way except with handwavy -- larger samples are > better. The sample size is picked based on what sample statistics > tells us we need to achieve a given 95th percentile confidence > interval for the bucket

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Greg Stark
On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas wrote: > Yes, I think a percentage of the table is going to break down either > at the high end or the low end.  Hand-waving (but based on > experience), for a 1000 row table a statistics target of 10 is > probably approximately right and 100 is too muc

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Robert Haas
On Wed, Oct 20, 2010 at 6:38 PM, Greg Stark wrote: > On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus wrote: Maybe what should be done about this is to have separate sizes for the MCV list and the histogram, where the MCV list is automatically sized during ANALYZE. >> >> It's been sugg

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Greg Stark
On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus wrote: > >>> Maybe what should be done about this is to have separate sizes for the >>> MCV list and the histogram, where the MCV list is automatically sized >>> during ANALYZE. > > It's been suggested multiple times that we should base our sample size

Re: [HACKERS] default_statistics_target WAS: max_wal_senders must die

2010-10-20 Thread Josh Berkus
>> Maybe what should be done about this is to have separate sizes for the >> MCV list and the histogram, where the MCV list is automatically sized >> during ANALYZE. It's been suggested multiple times that we should base our sample size on a % of the table, or at least offer that as an option. I