Re: [PERFORM] COPY vs INSERT

2005-05-05 Thread Kris Jurka


On Wed, 4 May 2005, Mischa Sandberg wrote:

> Quoting Kris Jurka <[EMAIL PROTECTED]>: 
>  
> > Not true.  A client may send any number of Bind/Execute messages on 
> > a prepared statement before a Sync message.

> Hunh. Interesting optimization in the JDBC driver. I gather it is 
> sending a string of (;)-separated inserts.

No, it uses the V3 protocol and a prepared statement and uses 
Bind/Execute, as I mentioned.

> Sounds like efficient-but-risky stuff we did for ODBC drivers at Simba
> ... gets interesting when one of the insert statements in the middle
> fails.

When running inside a transaction (as you really want to do anyway when
bulk loading) it is well defined, it is a little odd for auto commit mode
though.  In autocommit mode the transaction boundary is at the Sync
message, not the individual Execute messages, so you will get some
rollback on error.  The JDBC spec is poorly defined in this area, so we
can get away with this.

> Good to know. Hope that the batch size is parametric, given that
> you can have inserts with rather large strings bound to 'text' columns
> in PG --- harder to identify BLOBs when talking to PG, than when talking
> to MSSQL/Oracle/Sybase.

The batch size is not a parameter and I don't think it needs to be.  The 
issue of filling both sides of network buffers and deadlocking only needs 
to be avoided on one side.  The response to an insert request is small and 
not dependent on the size of the data sent, so we can send as much as we 
want as long as the server doesn't send much back to us.

Kris Jurka

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 1

2005-05-05 Thread Jona
Hi
I'm currently experiencing problems with long query execution times.
What I believe makes these problems particularly interesting is the 
difference in execution plans between our test server running PostGreSQL 
7.3.6 and our production server running PostGreSQL 7.3.9.
The test server is an upgraded "home machine", a Pentium 4 with 1GB of 
memory and IDE disk.
The production server is a dual CPU XEON Pentium 4 with 2GB memory and 
SCSI disks.
One should expect the production server to be faster, but appearently 
not as the outlined query plans below shows.

My questions can be summoned up to:
1) How come the query plans between the 2 servers are different?
2) How come the production server in general estimates the cost of the 
query plans so horribly wrong? (ie. it chooses a bad query plan where as 
the test server chooses a good plan)
3) In Query 2, how come the production server refuses the use its 
indexes (subcat_uq and aff_price_uq, both unique indexes) where as the 
test server determines that the indexes are the way to go
4) In Query 3, how come the test server refuses to use its index 
(sct2lang_uq) and the production server uses it? And why is the test 
server still faster eventhough it makes a sequential scan of a table 
with 8.5k records in?

Please note, a VACUUM ANALYSE is run on the production server once a day 
(used to be once an hour but it seemed to make no difference), however 
there are generally no writes to the tables used in the queries.

If anyone could shed some light on these issues I would truly appreciate 
it.

Cheers
Jona
PS. Please refer to part 2 for the other queries and query plans
 

Query 1:
EXPLAIN ANALYZE
SELECT DISTINCT StatConTrans_Tbl.id, Code_Tbl.sysnm AS code, 
PriceCat_Tbl.amount AS price, Country_Tbl.currency,
 CreditsCat_Tbl.amount AS credits, Info_Tbl.title, Info_Tbl.description
FROM (SCT2SubCatType_Tbl
INNER JOIN SCT2Lang_Tbl ON SCT2SubCatType_Tbl.sctid = SCT2Lang_Tbl.sctid
INNER JOIN Language_Tbl ON SCT2Lang_Tbl.langid = Language_Tbl.id AND 
Language_Tbl.sysnm = UPPER('us') AND Language_Tbl.enabled = true
INNER JOIN Info_Tbl ON SCT2SubCatType_Tbl.sctid = Info_Tbl.sctid AND 
Language_Tbl.id = Info_Tbl.langid
INNER JOIN SubCatType_Tbl ON SCT2SubCatType_Tbl.subcattpid = 
SubCatType_Tbl.id AND SubCatType_Tbl.enabled = true
INNER JOIN CatType_Tbl ON SubCatType_Tbl.cattpid = CatType_Tbl.id AND 
CatType_Tbl.enabled = true
INNER JOIN SuperCatType_Tbl ON CatType_Tbl.spcattpid = 
SuperCatType_Tbl.id AND SuperCatType_Tbl.enabled = true
INNER JOIN StatConTrans_Tbl ON SCT2SubCatType_Tbl.sctid = 
StatConTrans_Tbl.id AND StatConTrans_Tbl.enabled = true
INNER JOIN Price_Tbl ON StatConTrans_Tbl.id = Price_Tbl.sctid AND 
Price_Tbl.affid = 8
INNER JOIN PriceCat_Tbl ON Price_Tbl.prccatid = PriceCat_Tbl.id AND 
PriceCat_Tbl.enabled = true
INNER JOIN Country_Tbl ON PriceCat_Tbl.cntid = Country_Tbl.id AND 
Country_Tbl.enabled = true
INNER JOIN CreditsCat_Tbl ON Price_Tbl.crdcatid = CreditsCat_Tbl.id AND 
CreditsCat_Tbl.enabled = true
INNER JOIN StatCon_Tbl ON StatConTrans_Tbl.id = StatCon_Tbl.sctid AND 
StatCon_Tbl.ctpid = 1
INNER JOIN Code_Tbl ON SuperCatType_Tbl.id = Code_Tbl.spcattpid AND 
Code_Tbl.affid = 8 AND Code_Tbl.cdtpid = 1)
WHERE SCT2SubCatType_Tbl.subcattpid = 79
ORDER BY StatConTrans_Tbl.id DESC
LIMIT 8 OFFSET 0

Plan on PostGre 7.3.6 on Red Hat Linux 3.2.3-39
"Limit  (cost=178.59..178.61 rows=1 width=330) (actual time=22.77..28.51 
rows=4 loops=1)"
"  ->  Unique  (cost=178.59..178.61 rows=1 width=330) (actual 
time=22.77..28.50 rows=4 loops=1)"
"->  Sort  (cost=178.59..178.60 rows=1 width=330) (actual 
time=22.76..22.85 rows=156 loops=1)"
"  Sort Key: statcontrans_tbl.id, code_tbl.sysnm, 
pricecat_tbl.amount, country_tbl.currency, creditscat_tbl.amount, 
info_tbl.title, info_tbl.description"
"  ->  Hash Join  (cost=171.19..178.58 rows=1 width=330) 
(actual time=3.39..6.55 rows=156 loops=1)"
"Hash Cond: ("outer".cntid = "inner".id)"
"->  Nested Loop  (cost=170.13..177.51 rows=1 
width=312) (actual time=3.27..5.75 rows=156 loops=1)"
"  Join Filter: ("inner".sctid = "outer".sctid)"
"  ->  Hash Join  (cost=170.13..171.48 rows=1 
width=308) (actual time=3.12..3.26 rows=4 loops=1)"
"Hash Cond: ("outer".crdcatid = 
"inner".id)"
"->  Hash Join  (cost=169.03..170.38 
rows=1 width=300) (actual time=3.00..3.11 rows=4 loops=1)"
"  Hash Cond: ("outer".spcattpid = 
"inner".spcattpid)"
"  ->  Hash Join  
(cost=167.22..168.56 rows=1 width=253) (actual time=2.88..2.97 rows=4 
loops=1)"
"Hash Cond: ("outer".id = 
"inner".prccatid)"
"->  Seq Scan 

Re: [PERFORM] Bad choice of query plan from PG 7.3.6 to PG 7.3.9 part 1

2005-05-05 Thread Tom Lane
Jona <[EMAIL PROTECTED]> writes:
> I'm currently experiencing problems with long query execution times.
> What I believe makes these problems particularly interesting is the 
> difference in execution plans between our test server running PostGreSQL 
> 7.3.6 and our production server running PostGreSQL 7.3.9.
> The test server is an upgraded "home machine", a Pentium 4 with 1GB of 
> memory and IDE disk.
> The production server is a dual CPU XEON Pentium 4 with 2GB memory and 
> SCSI disks.
> One should expect the production server to be faster, but appearently 
> not as the outlined query plans below shows.

I think the plans are fine; it looks to me like the production server
has serious table-bloat or index-bloat problems, probably because of
inadequate vacuuming.  For instance compare these entries:

->  Index Scan using ctp_statcon on statcon_tbl  (cost=0.00..6.01 rows=1 
width=4) (actual time=0.05..0.31 rows=39 loops=4)
  Index Cond: ((statcon_tbl.sctid = "outer".sctid) AND (statcon_tbl.ctpid = 
1))

->  Index Scan using ctp_statcon on statcon_tbl  (cost=0.00..20.40 rows=5 
width=4) (actual time=27.97..171.84 rows=39 loops=4)
  Index Cond: ((statcon_tbl.sctid = "outer".sctid) AND (statcon_tbl.ctpid = 
1))

Appears to be exactly the same task ... but the test server spent
1.24 msec total while the production server spent 687.36 msec total.
That's more than half of your problem right there.  Some of the other
scans seem a lot slower on the production machine too.

> 1) How come the query plans between the 2 servers are different?

The production server's rowcount estimates are pretty good, the test
server's are not.  How long since you vacuumed/analyzed the test server?

It'd be interesting to see the output of "vacuum verbose statcon_tbl"
on both servers ...

regards, tom lane

PS: if you post any more query plans, please try to use software that
doesn't mangle the formatting so horribly ...

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Table stats

2005-05-05 Thread David Roussel
> Should there not be at least one Index Scan showing in the stats?
not if there was a table scan
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] COPY vs INSERT

2005-05-05 Thread Keith Worthington
Christopher Petrilli wrote:
On 5/4/05, Mischa Sandberg <[EMAIL PROTECTED]> wrote:
Quoting David Roussel <[EMAIL PROTECTED]>:

COPY invokes all the same logic as INSERT on the server side
(rowexclusive locking, transaction log, updating indexes, rules).
The difference is that all the rows are inserted as a single
transaction. This reduces the number of fsync's on the xlog,
which may be a limiting factor for you. You'll want to crank
WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though.
One of my streams has 6K records; I run with WB=1000, CS=128.
So what's the difference between a COPY and a batch of INSERT
statements.  Also, surely, fsyncs only occur at the end of a
transaction, no need to fsync before a commit has been issued,
right?
Sorry, I was comparing granularities the other way araound. As far as
xlog is concerned, a COPY is ALWAYS one big txn, no matter how many
putline commands you use to feed the copy. With inserts, you can choose
whether to commit every row, every nth row, etc.
Copy makes better use of the TCP connection for transmission. COPY uses
the TCP connection like a one-way pipe. INSERT is like an RPC: the
sender has to wait until the insert's return status roundtrips.

I have found even greater performance increases by using COPY FROM
 not COPY FROM STDIN.  This allows the backend process to
directly read the file, rather than shoving it over a pipe (thereby
potentially hitting the CPU multiple times).  My experience is that
this is anywhere from 5-10x faster than INSERT statements on the
whole, and sometimes 200x.
Chris
Unfortunately, COPY FROM '' can only be done by a superuser.  If 
you that option then that is great.  If not...

--
Kind Regards,
Keith
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] COPY vs INSERT

2005-05-05 Thread Jim C. Nasby
On Wed, May 04, 2005 at 10:22:56PM -0400, Tom Lane wrote:
> Also, there is a whole lot of one-time-per-statement overhead that can
> be amortized across many rows instead of only one.  Stuff like opening
> the target table, looking up the per-column I/O conversion functions,
> identifying trigger functions if any, yadda yadda.  It's not *that*
> expensive, but compared to an operation as small as inserting a single
> row, it's significant.

Has thought been given to supporting inserting multiple rows in a single
insert? DB2 supported:

INSERT INTO table VALUES(
(1,2,3),
(4,5,6),
(7,8,9)
);

I'm not sure how standard that is or if other databases support it.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly