Tom Lane wrote:
One objection to this is that after moving "off the gold standard" of
1.0 = one page fetch, there is no longer any clear meaning to the
cost estimate units; you're faced with the fact that they're just an
arbitrary scale. I'm not sure that's such a bad thing, though.
It seems
On Jun 4, 2006, at 5:09 PM, Tom Lane wrote:
Greg Stark <[EMAIL PROTECTED]> writes:
Hannu Krosing <[EMAIL PROTECTED]> writes:
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
Might also be worth adding analyze delay settings, ala
vacuum_cost_delay.
ANALYZE already respects th
Ühel kenal päeval, P, 2006-06-04 kell 18:09, kirjutas Tom Lane:
> Greg Stark <[EMAIL PROTECTED]> writes:
> > Hannu Krosing <[EMAIL PROTECTED]> writes:
> >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
> >>> Might also be worth adding analyze delay settings, ala
> >>> vacuum_co
Greg Stark <[EMAIL PROTECTED]> writes:
> Hannu Krosing <[EMAIL PROTECTED]> writes:
>> Ãhel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
>>> Might also be worth adding analyze delay settings, ala
>>> vacuum_cost_delay.
ANALYZE already respects the vacuum delay settings.
>> Actua
Hannu Krosing <[EMAIL PROTECTED]> writes:
> Ãhel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
>
> > Might also be worth adding analyze delay settings, ala
> > vacuum_cost_delay.
>
> Actually we should have delay settings for all potential
> (almost-)full-scan service ops, - V
Greg Stark <[EMAIL PROTECTED]> writes:
> I see improvements up to around 256M blocks or so, but my data is pretty
> questionable since I'm busy watching tv in Mythtv in another window. It's on
> another drive but it still seems to be making the numbers jump around a bit.
Oops, I meant blocks of 2
Greg Stark <[EMAIL PROTECTED]> writes:
> Perhaps what this indicates is that the real meat is in track sampling, not
> block sampling.
Fwiw, I've done a little benchmarking and I'm starting to think this isn't a
bad idea. I see a dramatic speed improvement for samples of 1-10% as the block
size
Hannu Krosing <[EMAIL PROTECTED]> writes:
> Disks can read at full rotation speed, so skipping (not reading) some
> blocks will not make reading the remaining blocks from the same track
> faster. And if there are more than 20 8k pages per track, you still have
> a very high probablility you need
pgbench appears to already support arbitrary SQL queries with the -f
switch, so why couldn't we just make it a little smarter and have people
enable SQL query logging for a day or two, then pass the log off to
pgbench:
pgbench -f
Seems to me like that wouldn't be too difficult to do, and would g
Ühel kenal päeval, R, 2006-06-02 kell 16:23, kirjutas Greg Stark:
> And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> 5% block sampling took just as long as reading all the blocks. Even if we
> figure out what's causing that (IMHO surprising) result and improve matter
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby:
> Might also be worth adding analyze delay settings, ala
> vacuum_cost_delay.
Actually we should have delay settings for all potential
(almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD
CONSTRAINT, maybe more - s
On Jun 2, 2006, at 5:24 PM, Todd A. Cook wrote:
Josh Berkus wrote:
Greg, Tom,
But for most users analyze doesn't really have to run as often as
vacuum. One sequential scan per night doesn't seem like that big
a deal
to me.
Clearly you don't have any 0.5 TB databases.
Perhaps something lik
Rod Taylor <[EMAIL PROTECTED]> writes:
>> One objection to this is that after moving "off the gold standard" of
>> 1.0 = one page fetch, there is no longer any clear meaning to the
>> cost estimate units; you're faced with the fact that they're just an
>> arbitrary scale. I'm not sure that's such
> One objection to this is that after moving "off the gold standard" of
> 1.0 = one page fetch, there is no longer any clear meaning to the
> cost estimate units; you're faced with the fact that they're just an
> arbitrary scale. I'm not sure that's such a bad thing, though. For
> instance, some
Josh Berkus writes:
> Greg, Tom,
>
> > But for most users analyze doesn't really have to run as often as
> > vacuum. One sequential scan per night doesn't seem like that big a deal
> > to me.
>
> Clearly you don't have any 0.5 TB databases.
Actually I did not so long ago.
Sequential scans
I wrote:
> In general it seems to me that for CPU-bound databases, the default values
> of the cpu_xxx_cost variables are too low. ... rather than telling people
> to manipulate all three of these variables individually, I think it might
> also be a good idea to provide a new GUC variable named so
Josh Berkus wrote:
Greg, Tom,
But for most users analyze doesn't really have to run as often as
vacuum. One sequential scan per night doesn't seem like that big a deal
to me.
Clearly you don't have any 0.5 TB databases.
Perhaps something like "ANALYZE FULL"? Then only those who need the
Greg, Tom,
> But for most users analyze doesn't really have to run as often as
> vacuum. One sequential scan per night doesn't seem like that big a deal
> to me.
Clearly you don't have any 0.5 TB databases.
> > I'd still be worried about the CPU pain though. ANALYZE can afford to
> > expend a
Tom Lane <[EMAIL PROTECTED]> writes:
> Greg Stark <[EMAIL PROTECTED]> writes:
> > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o
> > from
> > 5% block sampling took just as long as reading all the blocks. Even if we
> > figure out what's causing that (IMHO surprising) re
On Fri, Jun 02, 2006 at 01:39:32PM -0700, Michael Dean wrote:
> I'm sorry to interrupt your esoteric (to me) discussion, but I have
> a very simple question: would you define a "good unbiased sample"?
> My statistics professor Dan Price (rest his soul) would tell me
> there are only random sample
Greg Stark <[EMAIL PROTECTED]> writes:
> And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from
> 5% block sampling took just as long as reading all the blocks. Even if we
> figure out what's causing that (IMHO surprising) result and improve matters I
> would only expect it t
Greg Stark wrote:
Josh Berkus writes:
Using a variety of synthetic and real-world data sets, we show that
distinct sampling gives estimates for distinct values queries that
are within 0%-10%, whereas previous methods were typically 50%-250% off,
across the spectrum of data sets and q
Josh Berkus writes:
> > Using a variety of synthetic and real-world data sets, we show that
> > distinct sampling gives estimates for distinct values queries that
> > are within 0%-10%, whereas previous methods were typically 50%-250% off,
> > across the spectrum of data sets and queries
Greg,
> Using a variety of synthetic and real-world data sets, we show that
> distinct sampling gives estimates for distinct values queries that
> are within 0%-10%, whereas previous methods were typically 50%-250% off,
> across the spectrum of data sets and queries studied.
Aha. It's a
David Fetter <[EMAIL PROTECTED]> writes:
> > In the prior discussions someone posted the paper with the algorithm
> > I mentioned. That paper mentions that previous work showed poor
> > results at estimating n_distinct even with sample sizes as large as
> > 50% or more.
>
> Which paper? People
Josh, Greg, and Tom,
I do not know how sensitive the plans will be to the correlation,
but one thought might be to map the histogram X histogram correlation
to a square grid of values. Then you can map them to an integer which
would give you 8 x 8 with binary values, a 5 x 5 with 4 values per
poin
Hi All,
Just a small comment from a mortal user.
On Thursday 01 June 2006 19:28, Josh Berkus wrote:
> 5. random_page_cost (as previously discussed) is actually a funciton of
> relatively immutable hardware statistics, and as such should not need to
> exist as a GUC once the cost model is fixed.
I
Mark Kirkwood <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> With this model, the disk cost to fetch a single
>> index entry will be estimated as random_page_cost (default 4.0) rather
>> than the current fixed 2.0. This shouldn't hurt things too much for
>> simple indexscans --- especially since
Tom Lane wrote:
Another thing that's bothering me is that the index access cost computation
(in genericcostestimate) is looking sillier and sillier:
/*
* Estimate the number of index pages that will be retrieved.
*
* For all currently-supported index types, the first page of
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote:
>
> Josh Berkus writes:
>
> > Greg, Tom,
> >
> > > a) We already use block based sampling to reduce overhead. If
> > > you're talking about using the entire block and not just
> > > randomly sampled tuples from within those blocks the
Josh Berkus writes:
> Greg, Tom,
>
> > a) We already use block based sampling to reduce overhead. If you're
> > talking about using the entire block and not just randomly sampled
> > tuples from within those blocks then your sample will be biased.
>
> There are actually some really good equati
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> Speaking of plan instability, something that's badly needed is the
> ability to steer away from query plans that *might* be the most optimal,
> but also will fail horribly should the cost estimates be wrong.
You sure that doesn't leave us with the empty
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote:
> These would all be nice things to know, but I'm afraid it's pie in the
> sky. We have no reasonable way to get those numbers. (And if we could
> get them, there would be another set of problems, namely plan instability:
> the planner's c
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote:
>
> Josh Berkus writes:
>
> > 1. n-distinct estimation is bad, as previously discussed;
> >
> > 2. our current heuristics sampling methods prevent us from sampling more
> > than
> > 0.5% of any reasonably large table, causing all stat
Greg,
> > 1) You have n^2 possible two-column combinations. That's a lot of
> > processing and storage.
>
> Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.
Ooops, bad math. Andrew pointed out it's actually n*(n-1)/2, not n!.
Also, we could omit columns unlikely to corre
Greg, Tom,
> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.
There are actually some really good equations to work with this, estimating
bo
Josh Berkus writes:
> > However it will only make sense if people are willing to accept that
> > analyze will need a full table scan -- at least for tables where the DBA
> > knows that good n_distinct estimates are necessary.
>
> What about block-based sampling? Sampling 1 in 20 disk pages, r
Josh Berkus writes:
> Yeah. I've refrained from proposing changes because it's a
> pick-up-sticks. If we start modifying the model, we need to fix
> *everything*, not just one item. And then educate our users that they
> need to use the GUC variables in a different way. Here's the issues I
Greg,
> I'm convinced these two are more connected than you believe.
Actually, I think they are inseparable.
> I might be interested in implementing that algorithm that was posted a
> while back that involved generating good unbiased samples of discrete
> values. The algorithm was quite clever a
Josh Berkus writes:
> 1. n-distinct estimation is bad, as previously discussed;
>
> 2. our current heuristics sampling methods prevent us from sampling more than
> 0.5% of any reasonably large table, causing all statistics on those tables to
> be off for any table with irregular distribution of
Tom,
As you know, this is something I think about a bit too, though not
nearly as deeply as you.
In general it seems to me that for CPU-bound databases, the default values
of the cpu_xxx_cost variables are too low. I am tempted to raise the
default value of cpu_index_tuple_cost to 0.005, whi
I've been thinking about the planner's costing problems again,
particularly in connection with Philippe Lang's complaint here:
http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php
Investigation showed that the planner is much too optimistic about the
usefulness of a multi-index BitmapA
42 matches
Mail list logo