Re: [BUGS] BUG #1552: massive performance hit between 7.4 and 8.0.1

2005-03-25 Thread Simon Riggs
On Wed, 2005-03-23 at 14:22 -0500, Keith Browne wrote:
> Simon Riggs wrote:
> 
> > The EXPLAINs you've enclosed are for SELECTs, yet your bug report
> > describes INSERTs as being the things that are slow.
> > [You may find better performance from using COPY]

> We're starting with an empty database, creating four tables, and 
> populating those tables with a total of 180,000-200,000 rows.  Each 
> table has a primary key, and several of the tables reference foreign 
> keys in other tables.  We've written a Python script, using psycopg, 
> which executes all the queries to create the tables and insert the rows. 
>The database is running on the same machine where the script runs.
> 
> I've seen similar performance when issuing a COMMIT after each 
> insertion, and also after batching insertions in blocks of 250 per 
> COMMIT, so batching the commits is not helping much.  I've looked at the 
> possibility of using COPY, but in our production environment it will be 
> prohibitive to build a flat file with all this data.  I'd rather 
> generate it on the fly, as we've been able to do with PostgreSQL 7.4.
> 
> > Also, your tests have compared two systems, so it might be that the
> > hardware or configuration of one system is different from the other. 
> 
> When running with PostgreSQL 7.4 on a dual-CPU Athlon MP2400+ machine 
> with a gigabyte of RAM, running Debian Linux version 2.6.8.1, we were 
> able to insert all this data in 5-7 minutes.  It's taken a while to 
> install Postgres 8.0.1 on the same machine, but now I have, and it's 
> taking 40-45 minutes to run the same insert script.  This is similar to 
> the performance we saw on another machine, a fast single-CPU AMD64 box 
> running Gentoo.
> 
> I don't think it's a hardware issue.  I dug around a bit, and found 
> suggestions that this sort of problem could be worked around by breaking 
> the database connection and restarting it after the tables had been 
> partially filled.  I modified our script to break and re-establish the 
> database connection when each table first has 4,000 records inserted, 
> and the performance is greatly improved; it now takes only about 3.5 
> minutes to insert 180,000+ rows.
> 
> I've since modified this script to build and populate a fifth table with 
> over 1.3 million rows.  The fifth table has no primary key, but lists a 
> foreign key into one of the first four tables.  With the above 
> modification (break and re-build the DB connection after 4,000 rows have 
> been inserted), the whole database can be populated in about 15 minutes. 
>   I wouldn't have dared try to build a one-million-plus-row table until 
> I found this speed-up.
> 
> > If you could repeat the test on one single system, then this would
> > assist in the diagnosis of this bug report. Also, if you could describe
> > the workload that is giving you a problem more exactly, that would help.
> > Specifically, can you confirm that you have run ANALYZE on the tables,
> > and also give us some idea of numbers of rows in each table at the time
> > you first run your programs.
> 
> Just to see if it would help, I tried modifying the script to run an 
> ANALYZE against each table after 4,000 insertions, instead of breaking 
> and re-establishing the DB connection.  I still saw ~45-minute times to 
> insert 180,000 rows.  I then tried running ANALYZE against each table 
> after *each* 4,000 rows inserted, and again, it took about 45 minutes to 
> run the insert.
> 
> Each table is empty when I first run the program.  I am dropping and 
> re-creating the database for each test run.
> 
> > There is clearly a problem, but it is not yet clearly a bug. If it is a
> > bug, we're interested in solving it as much as you.
> 
> I'd be happy to run further tests or provide more details, if they'll 
> help.  We now have a workaround which is allowing us to proceed with our 
> project, but I'd like to know if there's another way to do this.  While 
> I understand that large or complex databases require careful tuning, I 
> was surprised to see a six- or seven-fold increase in run times between 
> PostgreSQL 7.4 and 8.0.1 on the same hardware, on an operation which 
> seems fairly straightforward: populating an empty table.
> 
> One other thing which puzzled me: as a test, I tried modifying our 
> script to spit out raw SQL statements instead of connecting to the 
> database and performing the inserts itself.  Normally, our script 
> populates two tables in one pass, and then populates the third and 
> fourth tables in a second pass.  I massaged the SQL by hand to group the 
> inserts together by table, so that the first table would be entirely 
> populated, then the second, etc.  When I ran this SQL script by piping 
> it straight into psql, it finished in about four minutes.  This is 
> comparable to the time it takes to run my modified script which breaks 
> and re-establishes the connection to the database.

OK. Not-a-bug.

Your situation is covered in the manual

Re: [BUGS] BUG #1552: massive performance hit between 7.4 and 8.0.1

2005-03-25 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes:
> I think we should spawn a TODO item from this:
> * Coerce FK lookups to always use an available index

No, we aren't doing that.

The correct TODO item is "Replan cached plans when table size has
changed a lot" which of course depends on having a framework to do
replanning at all.  I intend to take a look at that once Neil has
created such a framework ...

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly