[PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Craig Ringer
Hi all

I'm encountering an odd issue with a bulk import query using PostgreSQL
8.3. After a 400,000 row import into a just-truncated table `booking', a
sequential scan run on the table in the same transaction is incredibly
slow, taking ~ 166738.047 ms. After a:
`COMMIT; BEGIN;'
the same query runs in 712.615 ms, with almost all the time difference
being in the sequential scan of the `booking' table [schema at end of post].

The table is populated by a complex pl/pgsql function that draws from
several other tables to convert data from another app's format. After
that function runs, here's what happens if I do a simple sequential
scan, then what happens after I commit and run it again:

craig=# explain analyze select * from booking;
 QUERY PLAN
-
 Seq Scan on booking  (cost=0.00..9871.60 rows=320160 width=139)
 (actual time=0.017..166644.697 rows=341481 loops=1)
 Total runtime: 166738.047 ms
(2 rows)

craig=# commit; begin;
COMMIT
BEGIN
craig=# explain analyze select * from booking;
QUERY PLAN
--
 Seq Scan on booking  (cost=0.00..9871.60 rows=320160 width=139)
 (actual time=0.022..624.492 rows=341481 loops=1)
 Total runtime: 712.615 ms
(2 rows)


SET client_min_messages = 'debug5';

does not shed any more light; the only extra output is (eg):

craig=# select count(distinct id) from booking;
DEBUG:  StartTransactionCommand
DEBUG:  CommitTransactionCommand
 count

 341481
(1 row)

... which took 164558.961 ms to run, or about 2 tuples per second.
[Table schema at end of post]. By comparison, after commit the same
query read about 500 tuples/second.

This issue appears to affect any query that results in a sequential scan
on the newly populated table, and also affects maintenance operations
like ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY that perform
sequential scans. ANALYZE is also really slow. I'm not yet sure if index
scans are affected.

I'm not using WAL logging.

It doesn't matter whether I truncate the table before the import using a
separate transaction or the same one that does the import.

I see essentially identical results, and runtimes, with other more
complex queries, but it seems to boil down to extremely slow sequential
scans.

The Linux 2.6.22 host these queries are running on runs PostgreSQL 8.3.
It has 4GB of RAM and shmmax set to 512MB. Tweaking the postgresql
memory parameters seems to make little difference, but the ones that I
adjusted away from defaults to see if this was a resource issue are:

shared_buffers = 32MB
temp_buffers = 128MB
work_mem = 128MB
maintenance_work_mem = 1024MB   # min 1MB

(There are relatively few clients to this database, but they work it hard).

Is this huge speed difference in sequential scans expected behavior? Any
idea what might be causing it?

I'm presently working around it by just committing the transaction after
the bulk import - but there's lots more work to do after that and it
leaves the database in a rather messy interim state.




Here's the table's schema, pasted as a quote to stop Thunderbird
mangling it. There are no rules on this table except those that might be
created internally by postgresql.

> craig=# \d booking
>   Table "public.booking"
>  Column |   Type   |  
> Modifiers
> +--+--
>  id | integer  | not null default 
> nextval('booking_id_seq'::regclass)
>  customer_id| integer  | not null
>  edition_id | integer  | not null
>  description| character varying(255)   | not null
>  position   | integer  | not null
>  loading_applied| boolean  | not null default false
>  loading_ratio  | numeric(16,4)| not null
>  size_type  | integer  | not null
>  size_length| numeric(16,4)|
>  base_price | numeric(16,4)| not null
>  gst_factor | numeric(16,8)| not null default 
> gst_factor()
>  page_number| integer  |
>  invoiced   | timestamp with time zone |
>  contract_id| integer  |
>  old_customer_id| integer  | not null
>  booked_time| timestamp with time zone | not null
>  booked_by  | character varying(80)| not null
>  cancelled  | boolean  | not null default false
>  art_supplie

[PERFORM] Utility functions for enabling/disabling fkey triggers

2008-03-10 Thread Craig Ringer
Hi all

I've just spent some time working with PostgreSQL 8.3 trying to get a 90
minute job to run in a reasonable amount of time, and in the process
I've come up with something that I thought others might find useful.

Attached is a pair of PL/PgSQL functions that enable/disable the
triggers associated with a given foreign key constraint. They use the
system catalogs to obtain all the required information about the
involved tables. A fairly fast consistency check is performed before
re-enabling the triggers.

As it turns out I don't need it after all, but I though that others
doing really large data imports might given messages like:

http://archives.postgresql.org/pgsql-performance/2003-03/msg00157.php



I wrote it because I was frustrated with the slow execution of the ALTER
TABLE ... ADD CONSTRAINT ... FOREIGN KEY statements I was running to
rebuild the foreign key constraints on some of my tables after some bulk
imports. Leaving the constraints enabled was resulting in execution time
that increased for every record inserted, and rebuilding them after the
insert wasn't much faster.

Unfortunately it turns out that the issue wasn't with the way ALTER
TABLE ... ADD CONSTRAINT ... FOREIGN KEY was doing the check, as the
integrity check run by those functions is almost as slow as the ALTER
TABLE in the context of the transaction they're run in - and both run in
< 1 second outside of a transaction context or in a separate transaction.

Oh well, maybe the code will be useful to somebody anyway.

--
Craig Ringer
--
-- This file defines functions to (hopefully) reasonably safely enable and
-- disable enforcement of a foreign key constraint, written by Craig Ringer.
-- They're free for any use your care to make of them.
--
-- These functions work on the system that they're used on, but you
-- should still evaluate them for correctness and sanity before
-- adopting them yourself.
--
-- I make no guarantees that they won't destroy your data or steal your
-- lunch.
--


CREATE OR REPLACE FUNCTION disable_triggers_for_fkey_constraint(constraint_name VARCHAR) RETURNS void AS $$
DECLARE
tgrec RECORD;
relname VARCHAR;
constraint_type CHAR;
BEGIN
SELECT contype
INTO constraint_type
FROM pg_catalog.pg_constraint
WHERE pg_catalog.pg_constraint.conname = constraint_name;

IF constraint_type <> 'f' THEN
RAISE EXCEPTION 'Can only disable triggers for foreign key constraints';
END IF;

FOR tgrec IN SELECT tgname FROM pg_catalog.pg_trigger WHERE tgconstrname = constraint_name
LOOP
-- Obtain the name of the table this trigger affects. Foreign key
-- constraint triggers may affect the fkey or pkey tables and we have
-- to find out which in order to disable the constraint.
SELECT pg_catalog.pg_class.relname
INTO STRICT relname
FROM pg_catalog.pg_class INNER JOIN pg_catalog.pg_trigger
  ON pg_catalog.pg_trigger.tgrelid = pg_catalog.pg_class.oid
WHERE pg_catalog.pg_trigger.tgname=tgrec.tgname;

EXECUTE 'ALTER TABLE "'||relname||'" DISABLE TRIGGER "'||tgrec.tgname||'";';
END LOOP;
END;
$$ LANGUAGE 'plpgsql' VOLATILE;

COMMENT ON FUNCTION disable_triggers_for_fkey_constraint(VARCHAR) IS 'Disable enforcement of foreign key constraint $1';


--
-- This stored procedure does a rapid check of the referential integrity protected by `constraint_name'
-- (MUCH faster than the incredibly slow one postgresql does during ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY ...)
-- then re-enables the triggers that enforce the constraint.
--
-- It only works on foreign keys with only one column involved.
--
CREATE OR REPLACE FUNCTION enable_triggers_for_fkey_constraint(constraint_name VARCHAR) RETURNS void AS $$
DECLARE
tgrec RECORD;
relname VARCHAR;
foreign_key_misses RECORD;
constraint_info RECORD;
fkey_table_name VARCHAR;
pkey_table_name VARCHAR;
fkey_col_list VARCHAR;
fkey_col_not_null_clause VARCHAR;
pkey_col_list VARCHAR;
colname VARCHAR; -- temporary variable
-- Used to control comma insertion in loops
first BOOLEAN;
-- Loop variables
i INTEGER;
-- Query text
q VARCHAR;
BEGIN

-- Look up the tables and columns that the foreign key involves
SELECT
contype,
conrelid, -- oid of referencing relation
confrelid, -- oid of referenced relation
(SELECT pg_catalog.pg_type.typname FROM pg_catalog.pg_type WHERE pg_catalog.pg_type.typrelid = conrelid) AS conrelid_name, -- name of referencing relation
(SELECT pg_catalog.pg_type.typname FROM pg_catalog.pg_type WHERE pg_catalog.pg_type.typrelid = confrelid) AS confrelid_name, -- name of referenced relation
pg_catalog.pg_constraint.conkey, -- Position of referencing column, eg {14}
pg_catalog.pg_constraint.confkey -- Position of referenced column, eg {1}
INTO STRICT constraint_info
FROM pg_catalog.pg_constraint
WHERE pg_catalog.pg_constraint.con

Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Heikki Linnakangas

Craig Ringer wrote:

I'm encountering an odd issue with a bulk import query using PostgreSQL
8.3. After a 400,000 row import into a just-truncated table `booking', a
sequential scan run on the table in the same transaction is incredibly
slow, taking ~ 166738.047 ms. After a:
`COMMIT; BEGIN;'
the same query runs in 712.615 ms, with almost all the time difference
being in the sequential scan of the `booking' table [schema at end of post].

The table is populated by a complex pl/pgsql function that draws from
several other tables to convert data from another app's format. 


You must be having an exception handler block in that pl/pgsql function, 
which implicitly creates a new subtransaction on each invocation of the 
exception handler block, so you end up with hundreds of thousands of 
committed subtransactions. For each row in the seq scan, the list of 
subtransactions is scanned, to see if the transaction that inserted the 
row is part of the current top-level transaction. That's fine for a 
handful of subtransactions, but it gets really slow with large numbers 
of them, as you've seen. It's an O(n^2) operation, where n is the number 
of rows inserted, so you'll be in even more trouble if the number of 
rows increases.


As a work-around, avoid using exception handlers, or process more than 1 
row per function invocation. Or COMMIT the transaction, as you did.


For 8.4, it would be nice to improve that. I tested that on my laptop 
with a similarly-sized table, inserting each row in a pl/pgsql function 
with an exception handler, and I got very similar run times. According 
to oprofile, all the time is spent in TransactionIdIsInProgress. I think 
it would be pretty straightforward to store the committed subtransaction 
ids in a sorted array, instead of a linked list, and binary search. Or 
to use a hash table. That should eliminate this problem, though there is 
still other places as well where a large number of subtransactions will 
hurt performance.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Craig Ringer

Thanks for the extremely helpful response. I don't think I would've
spotted that one in a hurry.

You must be having an exception handler block in that pl/pgsql 
function, which implicitly creates a new subtransaction on each 
invocation of the exception handler block, so you end up with hundreds 
of thousands of committed subtransactions.


Aah - yes, there is. I didn't realize it'd have such an impact. I can
work around the need for it by explicitly checking the table constraints
in the function - in which case an uncaught exception will terminate the
transaction, but should only arise when I've missed a constraint check.

For 8.4, it would be nice to improve that. I tested that on my laptop 
with a similarly-sized table, inserting each row in a pl/pgsql 
function with an exception handler, and I got very similar run times. 
According to oprofile, all the time is spent in 
TransactionIdIsInProgress. I think it would be pretty straightforward 
to store the committed subtransaction ids in a sorted array, instead 
of a linked list, and binary search. Or to use a hash table. That 
should eliminate this problem, though there is still other places as 
well where a large number of subtransactions will hurt performance.


That does sound interesting - and it would be nice to be able to use
exception handlers this way without too huge a performance hit. In the
end though it's something that can be designed around once you're aware
of it - and I'm sure that other ways of storing that data have their own
different costs and downsides.

What might also be nice, and simpler, would be a `notice', `log', or 
even `debug1' level warning telling the user they've reached an absurd 
number of subtransactions that'll cripple PostgreSQL's performance - say

100,000. There's precedent for this in the checkpoint frequency warning
8.3 produces if checkpoints are becoming too frequent - and like that
warning it could be configurable for big sites. If you think that's sane
I might have a go at it - though I mostly work in C++ so the result
probably won't be too pretty initially.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Craig Ringer

Heikki Linnakangas wrote:
You must be having an exception handler block in that pl/pgsql 
function, which implicitly creates a new subtransaction on each 
invocation of the exception handler block, so you end up with hundreds 
of thousands of committed subtransactions.
I've just confirmed that that was indeed the issue, and coding around 
the begin block dramatically cuts the runtimes of commands executed 
after the big import function.


Thanks again!

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Tom Lane
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
> For 8.4, it would be nice to improve that. I tested that on my laptop 
> with a similarly-sized table, inserting each row in a pl/pgsql function 
> with an exception handler, and I got very similar run times. According 
> to oprofile, all the time is spent in TransactionIdIsInProgress. I think 
> it would be pretty straightforward to store the committed subtransaction 
> ids in a sorted array, instead of a linked list, and binary search.

I think the OP is not complaining about the time to run the transaction
that has all the subtransactions; he's complaining about the time to
scan the table that it emitted.  Presumably, each row in the table has a
different (sub)transaction ID and so we are thrashing the clog lookup
mechanism.  It only happens once because after that the XMIN_COMMITTED
hint bits are set.

This probably ties into the recent discussions about eliminating the
fixed-size allocations for SLRU buffers --- I suspect it would've run
better if it could have scaled up the number of pg_clog pages held in
memory.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Craig Ringer

Tom Lane wrote:

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
For 8.4, it would be nice to improve that. I tested that on my laptop 
with a similarly-sized table, inserting each row in a pl/pgsql function 
with an exception handler, and I got very similar run times. According 
to oprofile, all the time is spent in TransactionIdIsInProgress. I think 
it would be pretty straightforward to store the committed subtransaction 
ids in a sorted array, instead of a linked list, and binary search.


I think the OP is not complaining about the time to run the transaction
that has all the subtransactions; he's complaining about the time to
scan the table that it emitted.


Yes, but only in succeeding statements in the same transaction as the 
procedure that creates all the subtransactions. Table scan times return 
to normal after that transaction commits.



Presumably, each row in the table has a
different (sub)transaction ID and so we are thrashing the clog lookup
mechanism.  It only happens once because after that the XMIN_COMMITTED
hint bits are set.


It seems to happen with every statement run in the same transaction as, 
and after, the procedure with all the subtransactions. As soon as a 
COMMIT is executed, operations return to normal speed. There's no 
significant delay on the first statement after COMMIT as compared to 
subsequent statements, nor do successive statements before the COMMIT 
get faster.


In other words, if I repeatedly run one of the statements I used in 
testing for my initial post, like:


EXPLAIN ANALYZE SELECT * FROM booking;

... after running the problem stored procedure, it takes just as long 
for the second and third and so on runs as for the first.


As soon as I commit the transaction, the exact same statement returns to 
running in less than a second, and doesn't significantly change in 
runtime for subsequent executions.


I'll bang out a couple of examples at work tomorrow to see what I land 
up with, since this is clearly something that can benefit from a neat 
test case.


In any case, avoding the use of an exception block per record generated 
worked around the performance issues, so it's clearly something to do 
with the vast numbers of subtransactions - as Heikki Linnakangas 
suggested and tested.


--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Heikki Linnakangas

Tom Lane wrote:

"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
For 8.4, it would be nice to improve that. I tested that on my laptop 
with a similarly-sized table, inserting each row in a pl/pgsql function 
with an exception handler, and I got very similar run times. According 
to oprofile, all the time is spent in TransactionIdIsInProgress. I think 
it would be pretty straightforward to store the committed subtransaction 
ids in a sorted array, instead of a linked list, and binary search.


I think the OP is not complaining about the time to run the transaction
that has all the subtransactions; he's complaining about the time to
scan the table that it emitted.


If you read the original post carefully, he complained that the seq scan 
was slow when executed within the same transaction as populating the 
table, and fast if he committed in between.



 Presumably, each row in the table has a
different (sub)transaction ID and so we are thrashing the clog lookup
mechanism.  It only happens once because after that the XMIN_COMMITTED
hint bits are set.

This probably ties into the recent discussions about eliminating the
fixed-size allocations for SLRU buffers --- I suspect it would've run
better if it could have scaled up the number of pg_clog pages held in
memory.


I doubt that makes any noticeable difference in this case. 30 
transaction ids fit on < ~100 clog pages, and the xmins on heap pages 
are nicely in order.


Getting rid of the fixed-size allocations would be nice for other 
reasons, of course.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Tom Lane
Craig Ringer <[EMAIL PROTECTED]> writes:
> It seems to happen with every statement run in the same transaction as, 
> and after, the procedure with all the subtransactions. As soon as a 
> COMMIT is executed, operations return to normal speed.

Ah.  I misread your post as saying that it happened only once.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Heikki Linnakangas

Craig Ringer wrote:
I'll bang out a couple of examples at work tomorrow to see what I land 
up with, since this is clearly something that can benefit from a neat 
test case.


Here's what I used to reproduce this:

postgres=# BEGIN;
BEGIN
postgres=# CREATE TABLE foo (id int4,t text);CREATE TABLE
postgres=# CREATE OR REPLACE FUNCTION insertfunc() RETURNS void LANGUAGE 
plpgsql  AS $$

  begin
INSERT INTO foo VALUES ( 1, repeat('a',110));
  exception when unique_violation THEN end;
$$;
CREATE FUNCTION
postgres=# SELECT COUNT(insertfunc()) FROM generate_series(1,30); 
count


 30
(1 row)

postgres=# EXPLAIN ANALYZE SELECT COUNT(*) FROM foo; 
  QUERY PLAN 


--
 Aggregate  (cost=13595.93..13595.94 rows=1 width=0) (actual 
time=239535.904..239535.906 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..11948.34 rows=659034 width=0) 
(actual time=0.022..239133.898 rows=30 loops=1)

 Total runtime: 239535.974 ms
(3 rows)


The oprofile output is pretty damning:

samples  %symbol name
4214899.7468  TransactionIdIsCurrentTransactionId

If you put a COMMIT right before "EXPLAIN ANALYZE..." it runs in < 1s.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Craig Ringer

Tom Lane wrote:

Craig Ringer <[EMAIL PROTECTED]> writes:
  
It seems to happen with every statement run in the same transaction as, 
and after, the procedure with all the subtransactions. As soon as a 
COMMIT is executed, operations return to normal speed.



Ah.  I misread your post as saying that it happened only once.

No worries - it's best to be sure.

Thanks for looking into it.

--
Craig Ringer


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Gregory Stark
"Tom Lane" <[EMAIL PROTECTED]> writes:

> Well, scanning an index to get a count might be significantly faster
> than scanning the main table, but it's hardly "instantaneous".  It's
> still going to take time proportional to the table size.

Hm, Mark's comment about bitmap indexes makes that not entirely true. A bitmap
index can do RLE compression which makes the relationship between the size of
the table and the time taken to scan the index more complex. In the degenerate
case where there are no concurrent updates (assuming you can determine that
quickly) it might actually be constant time.

> Unless they keep a central counter of the number of index entries;
> which would have all the same serialization penalties we've talked
> about before...

Bitmap indexes do in fact have concurrency issues -- arguably they're just a
baroque version of this central counter in this case.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Tom Lane
"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:
> The oprofile output is pretty damning:

> samples  %symbol name
> 4214899.7468  TransactionIdIsCurrentTransactionId

Oh, I have no doubt that that could eat a lot of cycles inside the
originating transaction ;-).  I just misread Craig's complaint as
being about the cost of the first table scan *after* that transaction.

Getting rid of the linked-list representation would be a win in a couple
of ways --- we'd not need the bogus "list of XIDs" support in pg_list.h,
and xactGetCommittedChildren would go away.  OTOH AtSubCommit_childXids
would get more expensive.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] multi-threaded pgloader needs your tests

2008-03-10 Thread Dimitri Fontaine
Hi,

Le mardi 26 février 2008, Dimitri Fontaine a écrit :
> You may remember some thread about data loading performances and
> multi-threading support in pgloader:
>   http://archives.postgresql.org/pgsql-performance/2008-02/msg00081.php

As people here have asked for the new features implemented into pgloader 
2.3.0, I'm happy to post here about the availability of the new version!
  http://pgfoundry.org/projects/pgloader
  http://pgfoundry.org/forum/forum.php?forum_id=1283

Please consider this as a testbed related to the parallel COPY and pg_restore 
improvements which have been discussed here and on -hackers, as that's how 
those new features came to life.

Regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Rainer Pruy
We experienced a similar degradation,
when heavily using savepoints within a single transaction.
However, we had not yet enough time to really investigate the issue.
It also was not directly reproducible using a (small) set of statements from a 
script.
As the overall scenario "bulk loads with sub-transactions" is close to the 
scenario we do run, it might come down to the same reason, so.

Thus take my vote for a solution that does not end up with "don't use (sub-) 
transactions".

Regards,
Rainer

Craig Ringer schrieb:
> Thanks for the extremely helpful response. I don't think I would've
> spotted that one in a hurry.
> 
>> You must be having an exception handler block in that pl/pgsql
>> function, which implicitly creates a new subtransaction on each
>> invocation of the exception handler block, so you end up with hundreds
>> of thousands of committed subtransactions.
> 
> Aah - yes, there is. I didn't realize it'd have such an impact. I can
> work around the need for it by explicitly checking the table constraints
> in the function - in which case an uncaught exception will terminate the
> transaction, but should only arise when I've missed a constraint check.
> 
>> For 8.4, it would be nice to improve that. I tested that on my laptop
>> with a similarly-sized table, inserting each row in a pl/pgsql
>> function with an exception handler, and I got very similar run times.
>> According to oprofile, all the time is spent in
>> TransactionIdIsInProgress. I think it would be pretty straightforward
>> to store the committed subtransaction ids in a sorted array, instead
>> of a linked list, and binary search. Or to use a hash table. That
>> should eliminate this problem, though there is still other places as
>> well where a large number of subtransactions will hurt performance.
> 
> That does sound interesting - and it would be nice to be able to use
> exception handlers this way without too huge a performance hit. In the
> end though it's something that can be designed around once you're aware
> of it - and I'm sure that other ways of storing that data have their own
> different costs and downsides.
> 
> What might also be nice, and simpler, would be a `notice', `log', or
> even `debug1' level warning telling the user they've reached an absurd
> number of subtransactions that'll cripple PostgreSQL's performance - say
> 100,000. There's precedent for this in the checkpoint frequency warning
> 8.3 produces if checkpoints are becoming too frequent - and like that
> warning it could be configurable for big sites. If you think that's sane
> I might have a go at it - though I mostly work in C++ so the result
> probably won't be too pretty initially.
> 
> -- 
> Craig Ringer
> 

-- 
Rainer Pruy
Geschäftsführer

Acrys Consult GmbH & Co. KG
Untermainkai 29-30, D-60329 Frankfurt
Tel: +49-69-244506-0 - Fax: +49-69-244506-50
Web: http://www.acrys.com -  Email: [EMAIL PROTECTED]
Handelsregister: Frankfurt am Main, HRA 31151

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Joe Mirabal
Gregory,

I just joined this listserv and was happy to see this posting.  I have a
400GB table that I have indexed (building the index took 27 hours) , Loading
the table with 10 threads took 9 hours.  I run queries on the data nad get
immediate max and min as well as other aggrgate functions very quickly,
however a select count(*) of the table takes forever usually nearly an hour
or more.

Do you have any tuning recommendations.  We in our warehouse use the
count(*) as our verification of counts by day/month's etc and in Netezza its
immediate.  I tried by adding oids. BUT the situation I learned was that
adding the oids in the table adds a significasnt amount of space to the data
AND the index.

As you may gather from this we are relatively new on Postgres.

Any suggestions you can give me would be most helpful.

Cheers,
Joe

On Mon, Mar 10, 2008 at 11:16 AM, Gregory Stark <[EMAIL PROTECTED]>
wrote:

> "Tom Lane" <[EMAIL PROTECTED]> writes:
>
> > Well, scanning an index to get a count might be significantly faster
> > than scanning the main table, but it's hardly "instantaneous".  It's
> > still going to take time proportional to the table size.
>
> Hm, Mark's comment about bitmap indexes makes that not entirely true. A
> bitmap
> index can do RLE compression which makes the relationship between the size
> of
> the table and the time taken to scan the index more complex. In the
> degenerate
> case where there are no concurrent updates (assuming you can determine
> that
> quickly) it might actually be constant time.
>
> > Unless they keep a central counter of the number of index entries;
> > which would have all the same serialization penalties we've talked
> > about before...
>
> Bitmap indexes do in fact have concurrency issues -- arguably they're just
> a
> baroque version of this central counter in this case.
>
> --
>  Gregory Stark
>  EnterpriseDB  http://www.enterprisedb.com
>  Ask me about EnterpriseDB's Slony Replication support!
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



-- 
Mirabili et Veritas
Joe Mirabal


Re: [PERFORM] count * performance issue

2008-03-10 Thread Bill Moran
In response to "Joe Mirabal" <[EMAIL PROTECTED]>:

> Gregory,
> 
> I just joined this listserv and was happy to see this posting.  I have a
> 400GB table that I have indexed (building the index took 27 hours) , Loading
> the table with 10 threads took 9 hours.  I run queries on the data nad get
> immediate max and min as well as other aggrgate functions very quickly,
> however a select count(*) of the table takes forever usually nearly an hour
> or more.
> 
> Do you have any tuning recommendations.  We in our warehouse use the
> count(*) as our verification of counts by day/month's etc and in Netezza its
> immediate.  I tried by adding oids. BUT the situation I learned was that
> adding the oids in the table adds a significasnt amount of space to the data
> AND the index.
> 
> As you may gather from this we are relatively new on Postgres.
> 
> Any suggestions you can give me would be most helpful.

One approach to this problem is to create triggers that keep track of
the total count whenever rows are added or deleted.  This adds some
overhead to the update process, but the correct row count is always
quickly available.

Another is to use EXPLAIN to get an estimate of the # of rows from
the planner.  This works well if an estimate is acceptable, but can't
be trusted for precise counts.

Some searches through the archives should turn up details on these
methods.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Greg Smith

On Mon, 10 Mar 2008, Bill Moran wrote:


Some searches through the archives should turn up details on these
methods.


I've collected up what looked like the best resources on this topic into 
the FAQ entry at http://www.postgresqldocs.org/index.php/Slow_Count


General Bits has already done two good summary articles here and I'd think 
wading through the archives directly shouldn't be necessary.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Simon Riggs
On Mon, 2008-03-10 at 11:01 +, Heikki Linnakangas wrote:
> According 
> to oprofile, all the time is spent in TransactionIdIsInProgress. 

I recently submitted a patch to optimise this. Your comments would be
welcome on the patch.

-- 
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com 

  PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Scott Marlowe
On Mon, Mar 10, 2008 at 1:54 PM, Joe Mirabal <[EMAIL PROTECTED]> wrote:
> Gregory,
>
> I just joined this listserv and was happy to see this posting.  I have a
> 400GB table that I have indexed (building the index took 27 hours) , Loading
> the table with 10 threads took 9 hours.  I run queries on the data nad get
> immediate max and min as well as other aggrgate functions very quickly,
> however a select count(*) of the table takes forever usually nearly an hour
> or more.
>
> Do you have any tuning recommendations.  We in our warehouse use the
> count(*) as our verification of counts by day/month's etc and in Netezza its
> immediate.  I tried by adding oids. BUT the situation I learned was that
> adding the oids in the table adds a significasnt amount of space to the data
> AND the index.

Yeah, this is a typical problem people run into with MVCC databases to
one extent or another.  PostgreSQL has no native way to just make it
faster.  However, if it's a table with wide rows, you can use a lookup
table to help a bit.  Have a FK with cascading deletes from the master
table to a table that just holds the PK for it, and do count(*) on
that table.

Otherwise, you have the trigger solution mentioned previously.

Also, if you only need an approximate count, then you can use the
system tables to get that with something like

select reltuples from pg_class where relname='tablename';

after an analyze.  It won't be 100% accurate, but it will be pretty
close most the time.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Greg Smith

On Mon, 10 Mar 2008, Joe Mirabal wrote:

I run queries on the data nad get immediate max and min as well as other 
aggrgate functions very quickly, however a select count(*) of the table 
takes forever usually nearly an hour or more.


Are you sure the form of "select count(*)" you're using is actually 
utilizing the index to find a useful subset?  What do you get out of 
EXPLAIN ANALZYE on the query?


In order for indexes to be helpful a couple of things need to happen:
1) They have to be structured correctly to be useful
2) There needs to be large enough settings for shared_buffes and 
effective_cache_size that the database things it can use them efficiently
3) The tables involved need to be ANALYZEd to keep their statistics up to 
date.


The parameters to run a 400GB *table* are very different from the 
defaults; if you want tuning suggestions you should post the non-default 
entries in your postgresql.conf file from what you've already adjusted 
along with basic information about your server (PostgreSQL version, OS, 
memory, disk setup).


We in our warehouse use the count(*) as our verification of counts by 
day/month's etc


If you've got a database that size and you're doing that sort of thing on 
it, you really should be considering partitioning as well.


 --
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Miguel Arroz

Hi!

  I read and did many stuff you pointed me too. Raised shared buffers  
to 180 MB, and tried again. Same results.


  I deleted the DB, created a new one and generated new test data. I  
know have 72k rows, and the same query finishes in... 9 seconds.


  I'm totally clueless. Anyway, two questions:

  1) My working_mem is 2 MB. Does an UPDATE query like main depend on  
working_mem?


  2) I still feel this is all very trial-and-error. Change value, run  
query, hope it solves the problem. Well, the DB itself knows what is  
doing. Isn't there any way to make it tell us that? Like "the working  
mem is too low" or anything else. I know the problem is not the  
checkpoints, at least nothing appears on the log related to that. But  
it irritates me to be in front of a such complex system and not being  
able to know what's going on.


  Yours

Miguel Arroz

On 2008/03/10, at 05:10, Greg Smith wrote:


On Mon, 10 Mar 2008, Miguel Arroz wrote:

My question is, how can I "ask" PgSQL what's happening? How can I  
avoid guessing, and be sure of what is causing this slowdown?


There are many pieces involved here, and any one or multiple of them  
could be to blame.  Someone may make a guess and get lucky about the  
cause, but the only generic way to solve this sort of thing is to  
have a systematic approach that goes through the likely possible  
causes one by one until you've discovered the source of the  
problem.  Since as you say you're new to this, you've got the double  
task of learning that outline and then finding out how to run each  
of the tests.


For your particular case, slow updates, I usually follow the  
following series of tests.  I happen to have articles on most of  
these sitting around because they're common issues:


-Confirm disks are working as expected: 
http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm

-Look at differences between fsync commit behavior between the two  
systems.  It's often the case that when servers appear slower than  
development systems it's because the server is doing fsync properly,  
while the development one is caching fsync in a way that is unsafe  
for database use but much faster. http://www.postgresql.org/docs/8.3/static/wal-reliability.html 
 is a brief intro to this while http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm 
 goes into extreme detail.  The test_fsync section there is probably  
the most useful one for your comparision.


-Setup basic buffer memory parameters: 
http://www.westnet.com/~gsmith/content/postgresql/pg-5minute.htm

-VACUUM VERBOSE ANALYZE and make sure that's working properly.  This  
requires actually understanding the output from that command which  
is "fun" to figure out.  A related topic is looking for index bloat  
which I haven't found a good tutorial on yet.


-Investigate whether checkpoints are to blame.  Since you're running  
8.3 you can just turn on log_checkpoints and see how often they're  
showing up and get an idea how big the performance impact is.   
Increasing checkpoint_segments is the usual first thing to do if  
this is the case.


-Collect data with vmstat, iostat, and top to figure out what's  
happening during the problem query


-Look for application problems (not your issue here)

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com  
Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org 
)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Miguel Arroz
http://www.terminalapp.net
http://www.ipragma.com





smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mon, 10 Mar 2008 23:17:54 +
Miguel Arroz <[EMAIL PROTECTED]> wrote:

> Hi!
> 
>I read and did many stuff you pointed me too. Raised shared
> buffers to 180 MB, and tried again. Same results.
> 
>I deleted the DB, created a new one and generated new test data.
> I know have 72k rows, and the same query finishes in... 9 seconds.
> 
>I'm totally clueless. Anyway, two questions:
> 
>1) My working_mem is 2 MB. Does an UPDATE query like main depend
> on working_mem?
> 
>2) I still feel this is all very trial-and-error. Change value,
> run query, hope it solves the problem. Well, the DB itself knows what
> is doing. Isn't there any way to make it tell us that? Like "the
> working mem is too low" or anything else. I know the problem is not
> the checkpoints, at least nothing appears on the log related to that.
> But it irritates me to be in front of a such complex system and not
> being able to know what's going on.

What does iostat -k 1 tell you during the 9 seconds the query is
running?

Joshua D. Drake



- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
  PostgreSQL political pundit | Mocker of Dolphins

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH1cK3ATb/zqfZUUQRAhllAJ9C9aL9o/4hzq9vZyRaY8J6DknP5QCePDfS
BxJ/umrVArStUJgG3oFYsSE=
=n0uC
-END PGP SIGNATURE-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Miguel Arroz

Hi!

  It now raised to 40 seconds... here goes the result of iostat:

iostat -K -c 40
  tty ad4  ad6 cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   1   78 32.86  34  1.08   0.70   0  0.00  13  0  1  0 86
   0  180  6.00   4  0.02   0.00   0  0.00   0  0  0  0 100
   1   63 39.74  62  2.40   0.00   0  0.00  17  0  1  0 82
   0   60 18.69 815 14.87   0.00   0  0.00  20  0  2  0 79
   0   60 56.17 293 16.06   0.00   0  0.00  41  0  5  0 53
   0   60 55.74 396 21.53   0.00   0  0.00  39  0 10  0 51
   0   60 42.24 357 14.71   0.00   0  0.00  10  0  2  0 88
   0   60 42.92 354 14.82   0.00   0  0.00  12  0  7  1 80
   0   60 38.51 368 13.82   0.00   0  0.00  14  0  6  0 80
   0   60 43.83 326 13.94   0.00   0  0.00   4  0  1  0 95
   0   60 33.30 395 12.83   0.00   0  0.00  11  0  3  0 86
   0   60 41.36 395 15.94   0.00   0  0.00   4  0  3  0 93
   0   60 21.97 684 14.68   0.00   0  0.00  10  0  2  0 88
   0   60 72.44 297 20.99   0.00   0  0.00  42  0  9  0 48
   0   60 38.18 453 16.87   0.00   0  0.00  23  0  8  1 68
   0   60 35.15 365 12.52   0.00   0  0.00   1  0  1  0 97
   0   60 44.40 396 17.15   0.00   0  0.00  17  0  6  0 77
   0   60 43.99 341 14.64   0.00   0  0.00   4  0  2  0 93
   0   60 33.53 440 14.39   0.00   0  0.00  10  0  5  0 85
   0   60 31.22 345 10.51   0.00   0  0.00   0  0  2  0 97
  tty ad4  ad6 cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0   60 33.48 449 14.66   0.00   0  0.00  11  0  3  0 86
   0  180 16.85 599  9.87   0.00   0  0.00   1  0  1  0 98
   0   60 55.37 455 24.58   0.00   0  0.00  25  0  4  1 69
   0   60 49.83 376 18.28   0.00   0  0.00  18  0  5  1 76
   0   60 29.86 363 10.58   0.00   0  0.00   3  0  0  1 96
   0   60 36.21 365 12.90   0.00   0  0.00  12  0  3  1 84
   0   60 33.13 353 11.41   0.00   0  0.00   2  0  2  0 96
   0   60 39.47 345 13.28   0.00   0  0.00  16  0  3  0 80
   0   60 40.48 363 14.34   0.00   0  0.00   8  0  2  0 89
   0   60 30.91 397 11.97   0.00   0  0.00   5  0  2  0 93
   0   60 18.21 604 10.75   0.00   0  0.00   5  0  2  0 93
   0   60 48.65 359 17.04   0.00   0  0.00  20  0  6  0 74
   0   60 32.91 375 12.04   0.00   0  0.00  10  0  4  0 86
   0   60 35.81 339 11.84   0.00   0  0.00   3  0  2  0 96
   0   60 33.38 394 12.83   0.00   0  0.00  11  0  4  0 85
   0   60 34.40 313 10.51   0.00   0  0.00   4  0  2  0 93
   0   60 45.65 358 15.94   0.00   0  0.00  19  0  7  0 74
   0   60 37.41 309 11.28   0.00   0  0.00   3  0  2  0 95
   0   60 32.61 447 14.22   0.00   0  0.00  10  0  3  1 86
   0   60 17.11 516  8.63   0.00   0  0.00   1  0  1  0 98

  There's surely a lot of disk activity going on. With this figures,  
I could have written some hundred gigabytes during the query  
execution! Something is definitely not right here.


  Yours

Miguel Arroz

On 2008/03/10, at 23:22, Joshua D. Drake wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mon, 10 Mar 2008 23:17:54 +
Miguel Arroz <[EMAIL PROTECTED]> wrote:


Hi!

  I read and did many stuff you pointed me too. Raised shared
buffers to 180 MB, and tried again. Same results.

  I deleted the DB, created a new one and generated new test data.
I know have 72k rows, and the same query finishes in... 9 seconds.

  I'm totally clueless. Anyway, two questions:

  1) My working_mem is 2 MB. Does an UPDATE query like main depend
on working_mem?

  2) I still feel this is all very trial-and-error. Change value,
run query, hope it solves the problem. Well, the DB itself knows what
is doing. Isn't there any way to make it tell us that? Like "the
working mem is too low" or anything else. I know the problem is not
the checkpoints, at least nothing appears on the log related to that.
But it irritates me to be in front of a such complex system and not
being able to know what's going on.


What does iostat -k 1 tell you during the 9 seconds the query is
running?

Joshua D. Drake



- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL political pundit | Mocker of Dolphins

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH1cK3ATb/zqfZUUQRAhllAJ9C9aL9o/4hzq9vZyRaY8J6DknP5QCePDfS
BxJ/umrVArStUJgG3oFYsSE=
=n0uC
-END PGP SIGNATURE-

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org 
)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Miguel Arroz
http://www.terminalapp.net
http://www.ipragma.com





smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mon, 10 Mar 2008 23:46:10 +
Miguel Arroz <[EMAIL PROTECTED]> wrote:
tty ad4  ad6 cpu
>   tin tout  KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
> 0   60 33.48 449 14.66   0.00   0  0.00  11  0  3  0 86
> 0  180 16.85 599  9.87   0.00   0  0.00   1  0  1  0 98
> 0   60 55.37 455 24.58   0.00   0  0.00  25  0  4  1 69
> 0   60 49.83 376 18.28   0.00   0  0.00  18  0  5  1 76
> 0   60 29.86 363 10.58   0.00   0  0.00   3  0  0  1 96
> 0   60 36.21 365 12.90   0.00   0  0.00  12  0  3  1 84
> 0   60 33.13 353 11.41   0.00   0  0.00   2  0  2  0 96
> 0   60 39.47 345 13.28   0.00   0  0.00  16  0  3  0 80
> 0   60 40.48 363 14.34   0.00   0  0.00   8  0  2  0 89
> 0   60 30.91 397 11.97   0.00   0  0.00   5  0  2  0 93
> 0   60 18.21 604 10.75   0.00   0  0.00   5  0  2  0 93
> 0   60 48.65 359 17.04   0.00   0  0.00  20  0  6  0 74
> 0   60 32.91 375 12.04   0.00   0  0.00  10  0  4  0 86
> 0   60 35.81 339 11.84   0.00   0  0.00   3  0  2  0 96
> 0   60 33.38 394 12.83   0.00   0  0.00  11  0  4  0 85
> 0   60 34.40 313 10.51   0.00   0  0.00   4  0  2  0 93
> 0   60 45.65 358 15.94   0.00   0  0.00  19  0  7  0 74
> 0   60 37.41 309 11.28   0.00   0  0.00   3  0  2  0 95
> 0   60 32.61 447 14.22   0.00   0  0.00  10  0  3  1 86
> 0   60 17.11 516  8.63   0.00   0  0.00   1  0  1  0 98
> 
>There's surely a lot of disk activity going on. With this
> figures, I could have written some hundred gigabytes during the
> query execution! Something is definitely not right here.


Well the above says you are getting ~ 10-15MB/s a second performance.
What is the disk subsystem you have. Also note that the duration
probably went up because you didn't vacuum between tests.

What version of PostgreSQL (I missed it).

Joshua D. Drake 



- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
  PostgreSQL political pundit | Mocker of Dolphins

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH1cq5ATb/zqfZUUQRAhVvAKCfQk4Mg6qLNQfc6uyiI2TBSbkThACeK/5k
Tgc9ltxoOvnTMzKG2hG/4LY=
=Tm4N
-END PGP SIGNATURE-

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Miguel Arroz

Hi!

  The disk subsystem will be a RAID 1, but for now it's just a single  
7200 rpm 160 GB SATA hard drive. The PgSQL version is 8.3, the latest  
one.


  I have done some performance tests on the drive, and it handles  
about 40 MB/s on sequential writes, so I'm assuming it's OK.


  Yours

Miguel Arroz

On 2008/03/10, at 23:56, Joshua D. Drake wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Mon, 10 Mar 2008 23:46:10 +
Miguel Arroz <[EMAIL PROTECTED]> wrote:
   tty ad4  ad6 cpu

 tin tout  KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0   60 33.48 449 14.66   0.00   0  0.00  11  0  3  0 86
   0  180 16.85 599  9.87   0.00   0  0.00   1  0  1  0 98
   0   60 55.37 455 24.58   0.00   0  0.00  25  0  4  1 69
   0   60 49.83 376 18.28   0.00   0  0.00  18  0  5  1 76
   0   60 29.86 363 10.58   0.00   0  0.00   3  0  0  1 96
   0   60 36.21 365 12.90   0.00   0  0.00  12  0  3  1 84
   0   60 33.13 353 11.41   0.00   0  0.00   2  0  2  0 96
   0   60 39.47 345 13.28   0.00   0  0.00  16  0  3  0 80
   0   60 40.48 363 14.34   0.00   0  0.00   8  0  2  0 89
   0   60 30.91 397 11.97   0.00   0  0.00   5  0  2  0 93
   0   60 18.21 604 10.75   0.00   0  0.00   5  0  2  0 93
   0   60 48.65 359 17.04   0.00   0  0.00  20  0  6  0 74
   0   60 32.91 375 12.04   0.00   0  0.00  10  0  4  0 86
   0   60 35.81 339 11.84   0.00   0  0.00   3  0  2  0 96
   0   60 33.38 394 12.83   0.00   0  0.00  11  0  4  0 85
   0   60 34.40 313 10.51   0.00   0  0.00   4  0  2  0 93
   0   60 45.65 358 15.94   0.00   0  0.00  19  0  7  0 74
   0   60 37.41 309 11.28   0.00   0  0.00   3  0  2  0 95
   0   60 32.61 447 14.22   0.00   0  0.00  10  0  3  1 86
   0   60 17.11 516  8.63   0.00   0  0.00   1  0  1  0 98

  There's surely a lot of disk activity going on. With this
figures, I could have written some hundred gigabytes during the
query execution! Something is definitely not right here.



Well the above says you are getting ~ 10-15MB/s a second performance.
What is the disk subsystem you have. Also note that the duration
probably went up because you didn't vacuum between tests.

What version of PostgreSQL (I missed it).

Joshua D. Drake



- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL political pundit | Mocker of Dolphins

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH1cq5ATb/zqfZUUQRAhVvAKCfQk4Mg6qLNQfc6uyiI2TBSbkThACeK/5k
Tgc9ltxoOvnTMzKG2hG/4LY=
=Tm4N
-END PGP SIGNATURE-

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org 
)

To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Miguel Arroz
http://www.terminalapp.net
http://www.ipragma.com





smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Nested loop vs merge join: inconsistencies between estimated and actual time

2008-03-10 Thread Vlad Arkhipov

Tom Lane writes:

Vlad Arkhipov <[EMAIL PROTECTED]> writes:
  

I've came across this issue while writing report-like query for 2 not
very large tables. I've tried several methods to resolve this one (see
below). But now I'm really stuck...



It looks like you are wishing to optimize for all-in-memory situations,
in which case the traditional advice is to reduce random_page_cost to
something close to 1.  AFAICS all the rowcount estimates you're seeing
are spot on, or as close to spot on as you could realistically hope for,
and so the problem lies with the cost parameters.  Fooling with the
statistics is not going to help if the rowcount estimates are already
good.
  


I tried to change random_page_cost to 1.1 or something close to it and 
increase/decrease effective_cache_size. But Postgres always prefer plan 
with merge join.


Re: [PERFORM] count * performance issue

2008-03-10 Thread Robins Tharakan
Hi,

I have been reading this conversation for a few days now and I just wanted
to ask this. From the release notes, one of the new additions in 8.3 is
(Allow col IS NULL to use an index (Teodor)).

Sorry, if I am missing something here, but shouldn't something like this
allow us to get a (fast) accurate count ?

SELECT COUNT(*) from table WHERE indexed_field IS NULL
+
SELECT COUNT(*) from table WHERE indexed_field IS NOT NULL

*Robins Tharakan*

-- Forwarded message --
From: Greg Smith <[EMAIL PROTECTED]>
Date: Tue, Mar 11, 2008 at 4:31 AM
Subject: Re: [PERFORM] count * performance issue
To: Joe Mirabal <[EMAIL PROTECTED]>
Cc: pgsql-performance@postgresql.org


On Mon, 10 Mar 2008, Joe Mirabal wrote:

> I run queries on the data nad get immediate max and min as well as other
> aggrgate functions very quickly, however a select count(*) of the table
> takes forever usually nearly an hour or more.

Are you sure the form of "select count(*)" you're using is actually
utilizing the index to find a useful subset?  What do you get out of
EXPLAIN ANALZYE on the query?

In order for indexes to be helpful a couple of things need to happen:
1) They have to be structured correctly to be useful
2) There needs to be large enough settings for shared_buffes and
effective_cache_size that the database things it can use them efficiently
3) The tables involved need to be ANALYZEd to keep their statistics up to
date.

The parameters to run a 400GB *table* are very different from the
defaults; if you want tuning suggestions you should post the non-default
entries in your postgresql.conf file from what you've already adjusted
along with basic information about your server (PostgreSQL version, OS,
memory, disk setup).

> We in our warehouse use the count(*) as our verification of counts by
> day/month's etc

If you've got a database that size and you're doing that sort of thing on
it, you really should be considering partitioning as well.

 --
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] count * performance issue

2008-03-10 Thread Mark Mielke

Robins Tharakan wrote:

Hi,

I have been reading this conversation for a few days now and I just 
wanted to ask this. From the release notes, one of the new additions 
in 8.3 is (Allow col IS NULL to use an index (Teodor)).


Sorry, if I am missing something here, but shouldn't something like 
this allow us to get a (fast) accurate count ?


SELECT COUNT(*) from table WHERE indexed_field IS NULL
+
SELECT COUNT(*) from table WHERE indexed_field IS NOT NULL


For PostgreSQL: You still don't know whether the row is visible until 
you check the row. That it's NULL or NOT NULL does not influence this truth.


Cheers,
mark

--
Mark Mielke <[EMAIL PROTECTED]>



Re: [PERFORM] count * performance issue

2008-03-10 Thread Joshua D. Drake
On Tue, 11 Mar 2008 08:27:05 +0530
"Robins Tharakan" <[EMAIL PROTECTED]> wrote:

> SELECT COUNT(*) from table WHERE indexed_field IS NULL
> +
> SELECT COUNT(*) from table WHERE indexed_field IS NOT NULL

If the selectivity is appropriate yes. However if you have 1 million
rows, and 200k of those rows are null (or not null), it is still going
to seqscan.

joshua d. drake

-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL SPI Liaison | SPI Director |  PostgreSQL political pundit



signature.asc
Description: PGP signature


Re: [PERFORM] count * performance issue

2008-03-10 Thread Scott Marlowe
On Mon, Mar 10, 2008 at 7:57 PM, Robins Tharakan <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have been reading this conversation for a few days now and I just wanted
> to ask this. From the release notes, one of the new additions in 8.3 is
> (Allow col IS NULL to use an index (Teodor)).
>
> Sorry, if I am missing something here, but shouldn't something like this
> allow us to get a (fast) accurate count ?
>
> SELECT COUNT(*) from table WHERE indexed_field IS NULL
>  +
> SELECT COUNT(*) from table WHERE indexed_field IS NOT NULL

It really depends on the distribution of the null / not nulls in the
table.  If it's 50/50 there's no advantage to using the index, as you
still have to check visibility info in the table itself.

OTOH, if NULL (or converserly not null) are rare, then yes, the index
can help.  I.e. if 1% of the tuples are null, the select count(*) from
table where field is null can use the index efficiently.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] UPDATE 66k rows too slow

2008-03-10 Thread Greg Smith

On Mon, 10 Mar 2008, Miguel Arroz wrote:

I deleted the DB, created a new one and generated new test data. I know have 
72k rows, and the same query finishes in... 9 seconds.


This seems like more evidence that your problem here is related to dead 
rows (this is what Andrew suggested).  If a fresh copy of the database 
runs fast but it quickly degrades as you run additional tests that do many 
updates on it, that's a popular suspect.


Are you familiar with dead rows?  When you update something, the original 
copy doesn't go away; it stays behind until VACUUM gets to cleaning it up. 
If you update the same rows, say, 10 times you'll have 9 dead copies of 
every row in the way of doing reports on the ones still alive.


Let's go back to your original post a second:

Seq Scan on text_answer_mapping_ebt  (cost=0.00..13945.72 rows=265072 
width=92) (actual time=21.123..1049.054 rows=66268 loops=1)


That shows the database estimating there are exactly 4 times your 66268 
rows there (4X66268=265072).  That sounds like one active copy of your 
data and 3 dead ones left behind from earlier tests.  In that case, it 
would take much longer to do that full scan than when the database was 
fresh.


1) My working_mem is 2 MB. Does an UPDATE query like main depend on 
working_mem?


Nope.  That's used for sorting and that sort of thing.

Well, the DB itself knows what is doing. Isn't there any way to make it 
tell us that?


Well, the database server itself has a lot of operating system and 
hardware components it relies on, and it has no idea how any of those are 
working.  So it's unreasonable to expect in every case the database has a 
clue what's going on.


In your case, I'm suspecting more strongly the report that will say 
something interesting here is the 4th item on the list I sent before, 
looking at VACUUM VERBOSE ANALYZE output for a problem.


Here's the educational exercise I'd suggest that might help you track down 
what's going on here:


1) Recreate a fresh copy of the database.  Run VACUUM VERBOSE ANALYZE and 
save a copy of the output so you know what that looks like with no dead 
rows.

2) Run your query with EXPLAIN ANALYZE and save that too.  Should be fast.
3) Do whatever testing it is you do that seems to result in the system 
running much slower

4) Save the EXPLAIN ANALYZE output when you're reached slowness
5) Run a VACUUM VERBOSE ANALYZE, save that for comparision to the earlier 
6) Run the EXPLAIN ANALYZE again to see if (5) did anything useful.

one
7) Run VACUUM FULL VERBOSE and save that output
8) Run the EXPLAIN ANALYZE again to see if (7) did anything useful.

Comparing the VACUUM reports and the EXPLAIN plans to see what changes 
along the way should give you some good insight into what's happening 
here.  That is what you're asking for--asking the database to tell you 
what it's doing--but actually understanding that report takes a certain 
amount of study.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Very slow (2 tuples/second) sequential scan after bulk insert; speed returns to ~500 tuples/second after commit

2008-03-10 Thread Pavan Deolasee
On Mon, Mar 10, 2008 at 4:31 PM, Heikki Linnakangas
<[EMAIL PROTECTED]> wrote:
> According
>  to oprofile, all the time is spent in TransactionIdIsInProgress. I think
>  it would be pretty straightforward to store the committed subtransaction
>  ids in a sorted array, instead of a linked list, and binary search.

Assuming that in most of the cases, there will be many committed and few aborted
subtransactions, how about storing the list of *aborted* subtransactions ?


Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance