On Wed, Oct 20, 2010 at 6:41 PM, Josh Berkus wrote:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature. The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confiden
Josh Berkus writes:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature. The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confidence in ndistinct of 3X. I can't
On Wed, 2010-10-20 at 15:15 -0700, Josh Berkus wrote:
> >> Maybe what should be done about this is to have separate sizes for the
> >> MCV list and the histogram, where the MCV list is automatically sized
> >> during ANALYZE.
>
> It's been suggested multiple times that we should base our sample si
> That one's used, too, but the other is used as an upper bound.
> n_distinct tends to come out too small on large tables, so that
> formula is prone to overestimation. Actually, both formulas are prone
> to overestimation.
>
Right - thanks.
> When this happens depends on the values of a whole b
On Wed, Oct 20, 2010 at 9:53 PM, Nathan Boley wrote:
>> Robert explained why having more MCVs might be useful because we use
>> the frequency of the least common MCV as an upper bound on the
>> frequency of any value in the MCV.
>
> Where is that being used?
var_eq_const
> The only non-MCV frequ
> Robert explained why having more MCVs might be useful because we use
> the frequency of the least common MCV as an upper bound on the
> frequency of any value in the MCV.
Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows )/ndistinct
On Wed, Oct 20, 2010 at 7:13 PM, Greg Stark wrote:
> On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas wrote:
>> Yes, I think a percentage of the table is going to break down either
>> at the high end or the low end. Hand-waving (but based on
>> experience), for a 1000 row table a statistics target o
> I don't see why the MCVs would need a particularly large sample size
> to calculate accurately. Have you done any tests on the accuracy of
> the MCV list?
Yes, although I don't have them at my fingertips. In sum, though, you
can't take 10,000 samples from a 1b row table and expect to get a
rem
On Wed, Oct 20, 2010 at 6:03 PM, Josh Berkus wrote:
> I also just realized that I confused myself ... we don't really want
> more MCVs. What we want it more *samples* to derive a small number of
> MCVs. Right now # of samples and number of MCVs is inexorably bound,
> and they shouldn't be. On l
> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket
On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas wrote:
> Yes, I think a percentage of the table is going to break down either
> at the high end or the low end. Hand-waving (but based on
> experience), for a 1000 row table a statistics target of 10 is
> probably approximately right and 100 is too muc
On Wed, Oct 20, 2010 at 6:38 PM, Greg Stark wrote:
> On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus wrote:
Maybe what should be done about this is to have separate sizes for the
MCV list and the histogram, where the MCV list is automatically sized
during ANALYZE.
>>
>> It's been sugg
On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus wrote:
>
>>> Maybe what should be done about this is to have separate sizes for the
>>> MCV list and the histogram, where the MCV list is automatically sized
>>> during ANALYZE.
>
> It's been suggested multiple times that we should base our sample size
>> Maybe what should be done about this is to have separate sizes for the
>> MCV list and the histogram, where the MCV list is automatically sized
>> during ANALYZE.
It's been suggested multiple times that we should base our sample size
on a % of the table, or at least offer that as an option. I
14 matches
Mail list logo