On Wed, 2005-03-23 at 14:22 -0500, Keith Browne wrote:
> Simon Riggs wrote:
>
> > The EXPLAINs you've enclosed are for SELECTs, yet your bug report
> > describes INSERTs as being the things that are slow.
> > [You may find better performance from using COPY]
> We're starting with an empty database, creating four tables, and
> populating those tables with a total of 180,000-200,000 rows. Each
> table has a primary key, and several of the tables reference foreign
> keys in other tables. We've written a Python script, using psycopg,
> which executes all the queries to create the tables and insert the rows.
>The database is running on the same machine where the script runs.
>
> I've seen similar performance when issuing a COMMIT after each
> insertion, and also after batching insertions in blocks of 250 per
> COMMIT, so batching the commits is not helping much. I've looked at the
> possibility of using COPY, but in our production environment it will be
> prohibitive to build a flat file with all this data. I'd rather
> generate it on the fly, as we've been able to do with PostgreSQL 7.4.
>
> > Also, your tests have compared two systems, so it might be that the
> > hardware or configuration of one system is different from the other.
>
> When running with PostgreSQL 7.4 on a dual-CPU Athlon MP2400+ machine
> with a gigabyte of RAM, running Debian Linux version 2.6.8.1, we were
> able to insert all this data in 5-7 minutes. It's taken a while to
> install Postgres 8.0.1 on the same machine, but now I have, and it's
> taking 40-45 minutes to run the same insert script. This is similar to
> the performance we saw on another machine, a fast single-CPU AMD64 box
> running Gentoo.
>
> I don't think it's a hardware issue. I dug around a bit, and found
> suggestions that this sort of problem could be worked around by breaking
> the database connection and restarting it after the tables had been
> partially filled. I modified our script to break and re-establish the
> database connection when each table first has 4,000 records inserted,
> and the performance is greatly improved; it now takes only about 3.5
> minutes to insert 180,000+ rows.
>
> I've since modified this script to build and populate a fifth table with
> over 1.3 million rows. The fifth table has no primary key, but lists a
> foreign key into one of the first four tables. With the above
> modification (break and re-build the DB connection after 4,000 rows have
> been inserted), the whole database can be populated in about 15 minutes.
> I wouldn't have dared try to build a one-million-plus-row table until
> I found this speed-up.
>
> > If you could repeat the test on one single system, then this would
> > assist in the diagnosis of this bug report. Also, if you could describe
> > the workload that is giving you a problem more exactly, that would help.
> > Specifically, can you confirm that you have run ANALYZE on the tables,
> > and also give us some idea of numbers of rows in each table at the time
> > you first run your programs.
>
> Just to see if it would help, I tried modifying the script to run an
> ANALYZE against each table after 4,000 insertions, instead of breaking
> and re-establishing the DB connection. I still saw ~45-minute times to
> insert 180,000 rows. I then tried running ANALYZE against each table
> after *each* 4,000 rows inserted, and again, it took about 45 minutes to
> run the insert.
>
> Each table is empty when I first run the program. I am dropping and
> re-creating the database for each test run.
>
> > There is clearly a problem, but it is not yet clearly a bug. If it is a
> > bug, we're interested in solving it as much as you.
>
> I'd be happy to run further tests or provide more details, if they'll
> help. We now have a workaround which is allowing us to proceed with our
> project, but I'd like to know if there's another way to do this. While
> I understand that large or complex databases require careful tuning, I
> was surprised to see a six- or seven-fold increase in run times between
> PostgreSQL 7.4 and 8.0.1 on the same hardware, on an operation which
> seems fairly straightforward: populating an empty table.
>
> One other thing which puzzled me: as a test, I tried modifying our
> script to spit out raw SQL statements instead of connecting to the
> database and performing the inserts itself. Normally, our script
> populates two tables in one pass, and then populates the third and
> fourth tables in a second pass. I massaged the SQL by hand to group the
> inserts together by table, so that the first table would be entirely
> populated, then the second, etc. When I ran this SQL script by piping
> it straight into psql, it finished in about four minutes. This is
> comparable to the time it takes to run my modified script which breaks
> and re-establishes the connection to the database.
OK. Not-a-bug.
Your situation is covered in the manual