Re: [PERFORM] [GENERAL] how to get accurate values in pg_statistic (continued)

2003-09-11 Thread Christopher Browne
[EMAIL PROTECTED] (Bruce Momjian) writes:
> Tom Lane wrote:
>> Mary Edie Meredith <[EMAIL PROTECTED]> writes:
>> > Stephan Szabo kindly responded to our earlier queries suggesting
>> > we look at default_statistics_target and ALTER TABLE ALTER COLUMN
>> > SET STATISTICS.
>> 
>> > These determine the number of bins in the histogram for a given
>> > column.  But for a large number of rows (for example 6 million)
>> > the maximum value (1000) does not guarantee that ANALYZE will do
>> > a full scan of the table.  We do not see a way to guarantee the
>> > same statistics run to run without forcing ANALYZE to examine
>> > every row of every table.
>> 
>> Do you actually still have a problem with the plans changing when
>> the stats target is above 100 or so?  I think the notion of "force
>> ANALYZE to do a full scan" is inherently wrongheaded ... it
>> certainly would not produce numbers that have anything to do with
>> ordinary practice.
>> 
>> If you have data statistics that are so bizarre that the planner
>> still gets things wrong with a target of 1000, then I'd like to
>> know more about why.
>
> Has there been any progress in determining if the number of default
> buckets (10) is the best value?

I would think this is much more the key to the issue for their
benchmark than issues of correctly replicating the random number
generator.

I'm not clear on how data is collected into the histogram bins;
obviously it's not selecting all 6 million rows, but how many rows is
it?

The "right answer" for most use seems likely to involve:

 a) Getting an appropriate number of bins (I suspect 10 is a bit
small, but I can't justify that mathematically), and
 b) Attaching an appropriate sample size to those bins.

What is apparently going wrong with the benchmark (and this can
doubtless arise in "real life," too) is that the random selection is
pulling too few records with the result that some of the bins are
being filled in a "skewed" manner that causes the optimizer to draw
the wrong conclusions.  (I may merely be restating the obvious here,
but if I say it a little differently than it has been said before,
someone may notice the vital "wrong assumption.")

If the samples are crummy, then perhaps:
 - There need to be more bins
 - There need to be more samples

Does the sample size change if you increase the number of bins?  If
not, then having more, smaller bins will lead to them getting
increasingly skewed if there is any accidental skew in the selection.

Do we also need a parameter to control sample size?
-- 
output = reverse("ofni.smrytrebil" "@" "enworbbc")

Christopher Browne
(416) 646 3304 x124 (land)

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] how to get accurate values in pg_statistic (continued)

2003-09-11 Thread Tom Lane
Christopher Browne <[EMAIL PROTECTED]> writes:
> The "right answer" for most use seems likely to involve:
>  a) Getting an appropriate number of bins (I suspect 10 is a bit
> small, but I can't justify that mathematically), and

I suspect that also, but I don't have real evidence for it either.
We've heard complaints from a number of people for whom it was indeed
too small ... but that doesn't prove it's not appropriate in the
majority of cases ...

> Does the sample size change if you increase the number of bins?

Yes, read the comments in backend/commands/analyze.c.

> Do we also need a parameter to control sample size?

Not if the paper I read before writing that code is correct.

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Reading data in bulk - help?

2003-09-11 Thread Christopher Kings-Lynne
> You want values *much* higher than that.   How much RAM do you have?  See:
> http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html
> http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html

Now THAT is a remarkable document!  I vote for putting that information into
the PostgreSQL documentation tree.

Chris


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Upgrade Woes

2003-09-11 Thread aturner
Thanks for the URL, I went through postgresql.conf and made some modifications to the 
config based on information therein.  I will have to wait and see how it affects 
things, as I won't know for a week or so.

Select time has never been a problem, the DB has always been very fast, it's the 
insert time that has been a problem.  I'm not sure how much this is a function of the 
drive array sucking, the OS not doing a good job or the DB getting caught up in 
transaction logs.

What does seem odd is that the performance degrades as time goes on, and the space 
that the DB files takes up increases as well.

The Vacuum full is performed once at the end of the whole job.  We could probably get 
away with doing this once per week, but in the past I have noticed that if I don't run 
it regularlly, when I do run it, it seems to take much longer.  This has lead me to 
run more regularly than not.

As for 7.3.3, the project in question suffered a 10x performance degredation on 7.3.3 
which went away when we rolled back to 7.3.2.  Almost all the inserts had triggers 
which updated stats tables, the database in question was very very write heavy, it was 
pretty much a datawarehouse for X10 sensor information which was then mined for 
analysis.

I had certainly considered building the script to do binary seperation style inserts, 
split the job in half, insert, if it fails, split in half again until you get 
everything in.  This would probably work okay considering only about two dozen out of 
30,000 rows fail. The only reason not to do that it the time and effort required, 
particularly as we are looking at a substantial overhaul of the whole system in the 
next 6 months.

Alex Turner


On Wed, Sep 10, 2003 at 07:31:53PM +0100, Richard Huxton wrote:
> On Wednesday 10 September 2003 18:53, [EMAIL PROTECTED] wrote:
> > Hi,
> >
> > My name is Alex Turner and I work for a small Tech company in Pottstown PA.
> >  We run Postgresql on a number of systems for a variety of different
> > applications, and it has been a joy to deal with all around, working fast
> > and reliably for over 2 years.
> >
> > We recently upgraded from RedHat 7.2 to RedHat 9.0, and we are running
> > Postgres 7.3.2 on our Proliant ML370 (Raid 1 2x18 10k, and Raid 5 3x36 10k,
> > 2x866 PIII, 2GB RAM).
> [snip]
> > I have noticed that whilst inserts seem to be slower than before, the
> > vacuum full doesn't seem to take as long overall.
> >
> > postgresql.conf is pretty virgin, and we run postmaster with -B512 -N256
> > -i.  /var/lib/pgsql/data is a symlink to /eda/data, /eda being the mount
> > point for the Raid 5 array.
> 
> First things first then, go to:
> http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
> and read the item on Performance Tuning and the commented postgresql.conf
> 
> > the database isn't huge, storing about 3 properties, and the largest
> > table is 2.1 Million rows for property features.  The dump file is only
> > 221MB. Alas, I did not design the schema, but I have made several 'tweaks'
> > to it to greatly improve read performance allowing us to be the fastest
> > provider in the Tristate area.  Unfortunately the Job starts at 01:05
> > (thats the earliest the dump is available) and runs until completion
> > finishing with a vacuum full.  The vacuum full locks areas of the database
> > long enough that our service is temporarily down.  At the worst point, the
> > vacuum full was starting after 09:00, which our customers didn't
> > appreciate.
> 
> You might be able to avoid a vacuum full by tweaking the *fsm* settings to be 
> able to cope with activity.
> 
> > I'm wondering if there is anything I can do with postgres to allieviate
> > this problem.  Either upgrading to 7.3.4 (although I tried 7.3.3 for
> > another app, and we had to roll back to 7.3.2 because of performance
> > problems), 
> 
> Hmm - can't think what would have changed radically between 7.3.2 and 7.3.3, 
> upgrading to .4 is probably sensible.
> 
> [snip]
> > Any help/suggestions would be grealy appreciated,
> 
> You say that each insert/update is a separate transaction. I don't know how 
> much "bad" data you get in the dump, but you might be able to do something 
> like:
> 
> 1. Set batch size to 128 items
> 2. Read batch-size rows from the dump
> 3. Try to insert/update the batch. If it works, move along by the size of the 
> batch and back to #1
> 4. If batch-size=1, record error, move along one row and back to #1
> 5. If batch-size>1, halve batch-size and go back to #3
> 
> Your initial batch-size will depend on how many errors there are (but 
> obviously use a power of 2).
> 
> You could also run an ordinary vacuum every 1000 rows or so (number depends on 
> your *fsm* settings as mentioned above).
> 
> You might also want to try a REINDEX once a night/week too.
> -- 
>   Richard Huxton
>   Archonet Ltd
> 
> ---(end of broadcast)---
> TIP 9: the planner will ignore your desire to choose an in

Re: [PERFORM] Upgrade Woes

2003-09-11 Thread Tom Lane
[EMAIL PROTECTED] writes:
> As for 7.3.3, the project in question suffered a 10x performance
> degredation on 7.3.3 which went away when we rolled back to 7.3.2.

I would like to pursue that report and find out why.  I've just gone
through the CVS logs between 7.3.2 and 7.3.3, and I don't see any change
that would explain a 10x slowdown.  Can you provide more details about
exactly what slowed down?

Also, what PG version were you using on the old RedHat 7.2 installation?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Upgrade Woes

2003-09-11 Thread Jeff
On Thu, 11 Sep 2003, [EMAIL PROTECTED] wrote:

>
> The Vacuum full is performed once at the end of the whole job.
>
have you also tried vacuum analyze periodically - it does not lock the
table and can help quite a bit?

still odd why it would be that much slower between those versions.

--
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [GENERAL] how to get accurate values in pg_statistic

2003-09-11 Thread Christopher Browne
[EMAIL PROTECTED] ("scott.marlowe") writes:
> On Thu, 11 Sep 2003, Tom Lane wrote:
>
>> Christopher Browne <[EMAIL PROTECTED]> writes:
>> > The "right answer" for most use seems likely to involve:
>> >  a) Getting an appropriate number of bins (I suspect 10 is a bit
>> > small, but I can't justify that mathematically), and
>> 
>> I suspect that also, but I don't have real evidence for it either.
>> We've heard complaints from a number of people for whom it was indeed
>> too small ... but that doesn't prove it's not appropriate in the
>> majority of cases ...
>> 
>> > Does the sample size change if you increase the number of bins?
>> 
>> Yes, read the comments in backend/commands/analyze.c.
>> 
>> > Do we also need a parameter to control sample size?
>> 
>> Not if the paper I read before writing that code is correct.
>
> I was just talking to a friend of mine who does statistical analysis, and 
> he suggested a different way of looking at this.  I know little of the 
> analyze.c, but I'll be reading it some today.
>
> His theory was that we can figure out the number of target bins by 
> basically running analyze twice with two different random seeds, and 
> initially setting the bins to 10.
>
> The, compare the variance of the two runs.  If the variance is great, 
> increase the target by X, and run two again.  repeat, wash, rinse, until 
> the variance drops below some threshold.
>
> I like the idea, I'm not at all sure if it's practical for Postgresql to 
> implement it.

It may suffice to do some analytic runs on some "reasonable datasets"
in order to come up with a better default than 10.

If you run this process a few times on some different databases and
find that the variance keeps dropping pretty quickly, then that would
be good material for arguing that 10 should change to 17 or 23 or 31
or some such value.  (The only interesting pttern in that is that
those are all primes :-).)
-- 
output = ("cbbrowne" "@" "libertyrms.info")

Christopher Browne
(416) 646 3304 x124 (land)

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Upgrade Woes

2003-09-11 Thread aturner
In the performance case the machine was running RedHat AS 2.1.  I have posted the 
database schema at (obtained from pg_dump -s):

http://serverbeach.plexq.com/~aturner/schema.sql

The time to run all the stats procedures dropped through the floor.  
refresh_hourly_iud, adl_hourly_iud, rebuild_daily_total etc.  There is a python script 
that calls the proc once for each hour or day.  When running the historical calc job 
for a 7 day period back, it would crawl on 7.3.3.  We started benching the drive array 
and found other issues with the system in the mean time (like the drive array was 
giving us 10MB/sec write speed - the guy who set it up did not enable write to cache). 
 Once it was reconfigured the DB performance did not improve much (bonnie++ was used 
to verify the RAID array speed).

Alex Turner

On Thu, Sep 11, 2003 at 11:47:32AM -0400, Tom Lane wrote:
> [EMAIL PROTECTED] writes:
> > As for 7.3.3, the project in question suffered a 10x performance
> > degredation on 7.3.3 which went away when we rolled back to 7.3.2.
> 
> I would like to pursue that report and find out why.  I've just gone
> through the CVS logs between 7.3.2 and 7.3.3, and I don't see any change
> that would explain a 10x slowdown.  Can you provide more details about
> exactly what slowed down?
> 
> Also, what PG version were you using on the old RedHat 7.2 installation?
> 
>   regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] [GENERAL] how to get accurate values in pg_statistic

2003-09-11 Thread scott.marlowe
On Thu, 11 Sep 2003, Tom Lane wrote:

> Christopher Browne <[EMAIL PROTECTED]> writes:
> > The "right answer" for most use seems likely to involve:
> >  a) Getting an appropriate number of bins (I suspect 10 is a bit
> > small, but I can't justify that mathematically), and
> 
> I suspect that also, but I don't have real evidence for it either.
> We've heard complaints from a number of people for whom it was indeed
> too small ... but that doesn't prove it's not appropriate in the
> majority of cases ...
> 
> > Does the sample size change if you increase the number of bins?
> 
> Yes, read the comments in backend/commands/analyze.c.
> 
> > Do we also need a parameter to control sample size?
> 
> Not if the paper I read before writing that code is correct.

I was just talking to a friend of mine who does statistical analysis, and 
he suggested a different way of looking at this.  I know little of the 
analyze.c, but I'll be reading it some today.

His theory was that we can figure out the number of target bins by 
basically running analyze twice with two different random seeds, and 
initially setting the bins to 10.

The, compare the variance of the two runs.  If the variance is great, 
increase the target by X, and run two again.  repeat, wash, rinse, until 
the variance drops below some threshold.

I like the idea, I'm not at all sure if it's practical for Postgresql to 
implement it.


---(end of broadcast)---
TIP 8: explain analyze is your friend