Re: [PERFORM] update 600000 rows

2007-12-15 Thread Loïc Marteau

Steve Crawford wrote:

[EMAIL PROTECTED] wrote:

Hello

i have a python script to update 60 rows to one table from a csv 
file in my

postgres database and it takes me 5 hours to do the transaction...

  

Let's see if I guessed correctly.

Your Python script is stepping through a 600,000 row file and updating 
information in a table (of unknown rows/columns) by making 600,000 
individual updates all wrapped in a big transaction. If correct, that 
means you are doing 600,000/(3,600 * 5) = 33 queries/second. If this 
is correct, I'd first investigate simply loading the csv data into a 
temporary table, creating appropriate indexes, and running a single 
query to update your other table.


i can try this. The problem is that i have to make an insert if the 
update don't have affect a rows (the rows don't exist yet). The number 
of rows affected by insert is minor regards to the numbers of updated 
rows and was 0 when i test my script). I can do with a temporary table  
: update all the possible rows and then insert the rows that are in 
temporary table and not in the production table with a 'not in' 
statement. is this a correct way ?


First when i run htop i see that the memory used is never more than 
150 MB.

I don't understand in this case why setting shmall and shmmax kernel's
parameters to 16 GB of memory (the server has 32 GB) increase the 
rapidity of

the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!
  
Are you saying that you did this and the performance improved or you 
are wondering if it would?


Yes i did this and the perfomance improved. Dont understand why. Sorry 
for my poor english...



The script is run with only one transaction and pause by moment to 
let the time

to postgres to write data to disk.
  
This doesn't make sense. If the transaction completes successfully 
then PostgreSQL has committed the data to disk (unless you have done 
something non-standard and not recommended like turning off fsync). If 
you are adding pauses between updates, don't do that - it will only 
slow you down. If the full transaction doesn't complete, all updates 
will be thrown away anyway and if it does complete then they were 
committed.


Sorry, the pause is not caused by the python script but by postgres 
himself. it does an average of +-3000 update and pause 2 min (htop say 
me that postgres is in writing process don't really know if it does io 
writing). I say that : if he writes to disk some things during the 
transaction i don't understand why ?!
If the data were writed at the end of the transaction will be the 
perfomance
better ? i wan't that in production data regulary writed to disk to 
prevent
loosinf of data but it there any interest to write temporary data in 
disk in a

middle of a transaction ???

  
See above. Actual disk IO is handled by the server. PostgreSQL is good 
at the "D" in ACID. If your transaction completes, the data has been 
written to disk. Guaranteed.


Cheers,
Steve


i try to say that in "normal" use (not when i run this maintenance 
script) i want to be sure that by insert update request are write to 
disk. They are small (1,2 or 3 rows affected) but they are a lot and 
doing by many users. However just for this maintenance script i can 
perhaps doing other tweak to adjust the io stress during the transaction ?!


Cheers,

Loic



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] update 600000 rows

2007-12-15 Thread Greg Smith

On Sat, 15 Dec 2007, [EMAIL PROTECTED] wrote:


First when i run htop i see that the memory used is never more than 150 MB.
I don't understand in this case why setting shmall and shmmax kernel's
parameters to 16 GB of memory (the server has 32 GB) increase the rapidity of
the transaction a lot compared to a shmall and shmax in (only) 2 GB ?!


The kernel parameters provide an upper limit for how much memory 
PostgreSQL can allocate, but by themselves they don't actually request 
more memory.  There is a configuration parameters called shared_buffers 
that is the main thing to adjust.  Since you say you're new to this, see 
http://www.westnet.com/~gsmith/content/postgresql/pg-5minute.htm for the 
first set of things you should be adjusting.


If you're doing lots of updates, you'll need to increase 
checkpoint_segments as well.  Once you get the memory allocated and 
checkpoint parameters in the right range, at that point you'll be prepared 
to look into transaction grouping and application issues in that area.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] update 600000 rows

2007-12-15 Thread andrew
Loïc Marteau <[EMAIL PROTECTED]> wrote ..
> Steve Crawford wrote:
> > If this
> > is correct, I'd first investigate simply loading the csv data into a
> > temporary table, creating appropriate indexes, and running a single 
> > query to update your other table.

My experience is that this is MUCH faster. My predecessor in my current 
position was doing an update from a csv file line by line with perl. That is 
one reason he is my predecessor. Performance did not justify continuing his 
contract.
 
> i can try this. The problem is that i have to make an insert if the 
> update don't have affect a rows (the rows don't exist yet). The number
> of rows affected by insert is minor regards to the numbers of updated 
> rows and was 0 when i test my script). I can do with a temporary table
> : update all the possible rows and then insert the rows that are in 
> temporary table and not in the production table with a 'not in' 
> statement. is this a correct way ?

That's what I did at first, but later I found better performance with a TRIGGER 
on the permanent table that deletes the target of an UPDATE, if any, before the 
UPDATE. That's what PG does anyway, and now I can do the entire UPDATE in one 
command.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID arrays and performance

2007-12-15 Thread Bruce Momjian
Mark Mielke wrote:
> Asynchronous I/O is no more a magic bullet than threading. It requires a 
> lot of work to get it right, and if one gets it wrong, it can be slower 
> than the regular I/O or single threaded scenarios. Both look sexy on 
> paper, neither may be the solution to your problem. Or they may be. We 
> wouldn't know without numbers.

Agreed.  We currently don't use multiple CPUs or multiple disks
efficiently for single-query loads.  There is certainly more we could do
in these areas, and it is on the TODO list.

The good news is that most work loads are multi-user and we use
resources more evenly in those cases.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate