[GENERAL] Need to update all my 60 million rows at once without transactional integrity

christian_behrens Wed, 23 Apr 2008 01:01:30 -0700

Hi!

How can I make a Update of a column in a very large table for all rows without 
using the double amount of disc space and without any need for atomic operation?


I have a very large table with about 60 million rows. I sometimes need to do a 
simple update to ALL rows that resets a status-flag to zero.

I don't need to have transactional integrity (but of course if the system 
crashes, there should be no data corruption. A separate flag in the file system 
can well save the fact that that bulk update was in progress) for this, I don't 
care or want a abort or "all or nothing" like SQL mandates. The idea is 
basically that either this update succeeds or it succeeds or - there is no 
"not". It must succeed. It must be tried until it works. If the update is 
halfway finished, that's okay.

If I just do an 
UPDATE table SET flag=0;
then Pg will make a copy of every row which must be cleaned up by vaccuum. I 
understand - and don't like during this specific problem - that PG is a MVCC 
database and will keep my old versions of the rows before the update alive. 
This is very bad.

If I do a batched loop like this:
UPDATE table SET flag=0 where id>=0 and id <200;
UPDATE table SET flag=0 where id>=200 and id <400;
UPDATE table SET flag=0 where id>=400 and id <600;
...

then PG will seek all over my harddrive I think.

It would be much better if it could just start in the beginning of the table 
and work it's way towards the end. But which sort-criteria would satisfy this? 
I don't think that there is any SQL that does something like that.


Another ideas (and I think it's a quite good idea) would be to
drop the column and recreate it with a new default value.

But the problem is that this is not actually MY database, but an appliance 
(which has a harddrive that does not have the double amount of space available 
btw) and it has to work reliably whenever something special happens.

And I don't think I should create new columns (the old one would be hidden and 
their internal column ids lost I think) all the time, that might have a limit.

Can I do it maybe every day??


Is there any other way to go?

I would really like to kind of "skip transactions". Of course basic data 
integrity in on disc structures, but not atomicy for this bulk update!

I read that PG has an internal command language that is used to build up a 
database when all the tables for e.g. table-names are not present yet.

Could I use that to hack my way around transactions?

Basically I can do everything to this PG installation, this is an extremly 
controlled, locked down environment. I don't need to care for anyone else, it's 
always the same schema, the same version, same OS, etc. and I could happily 
apply any hack that solves the problem.

Many thanks for any pointers or ideas how to solve this problem!
 Christian
-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Need to update all my 60 million rows at once without transactional integrity

Reply via email to