On 30.03.2011 21:06, Jon Nelson wrote:
The short version is that if a postgresql backend is killed (by the Linux
OOM handler, or kill -9, etc...) while operations are
taking place in a *different* backend, corruption is introduced in the other
backend.  I don't want to say it happens 100% of the time, but it happens
every time I test.
...

Here is how I am reproducing the problem:

1. Open a psql connection to database A. It may remain idle.
2. Wait for an automated process to connect to database B and start
operations. These operations
3. kill -9 the backend for the psql connection to database A.

Then I observe the backends all shutting down and postgresql entering
recovery mode, which succeeds.
Subsequent operations on other databases appear fine, but not for
database B: An index on one of the tables in database B is corrupted.
It is always the
same index.

2011-03-30 14:51:32 UTC   LOG:  server process (PID 3871) was terminated by
signal 9: Killed
2011-03-30 14:51:32 UTC   LOG:  terminating any other active server
processes
2011-03-30 14:51:32 UTC   WARNING:  terminating connection because of crash
of another server process
2011-03-30 14:51:32 UTC   DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2011-03-30 14:51:32 UTC   HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2011-03-30 14:51:32 UTC databaseB databaseB WARNING:  terminating connection
because of crash of another server process
2011-03-30 14:51:32 UTC databaseB databaseB DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
2011-03-30 14:51:32 UTC databaseB databaseB HINT:  In a moment you should be
able to reconnect to the database and repeat your command.
2011-03-30 14:51:32 UTC   LOG:  all server processes terminated;
reinitializing
2011-03-30 14:51:32 UTC   LOG:  database system was interrupted; last known
up at 2011-03-30 14:46:50 UTC
2011-03-30 14:51:32 UTC databaseB databaseB FATAL:  the database system is
in recovery mode
2011-03-30 14:51:32 UTC   LOG:  database system was not properly shut down;
automatic recovery in progress
2011-03-30 14:51:32 UTC   LOG:  redo starts at 301/1D328E40
2011-03-30 14:51:33 UTC databaseB databaseB FATAL:  the database system is
in recovery mode
2011-03-30 14:51:34 UTC   LOG:  record with zero length at 301/1EA08608
2011-03-30 14:51:34 UTC   LOG:  redo done at 301/1EA08558
2011-03-30 14:51:34 UTC   LOG:  last completed transaction was at log time
2011-03-30 14:51:31.257997+00
2011-03-30 14:51:37 UTC   LOG:  autovacuum launcher started
2011-03-30 14:51:37 UTC   LOG:  database system is ready to accept
connections
2011-03-30 14:52:05 UTC databaseB databaseB ERROR:  index "<elided>"
contains unexpected zero page at block 0
2011-03-30 14:52:05 UTC databaseB databaseB HINT:  Please REINDEX it.

What's more, I can execute a 'DELETE from tableB' (where tableB is the
table that is the one with the troublesome index) without error, but
when I try to *insert* that is when I get a problem. The index is a
standard btree index. The DELETE statement has no where clause.

Can you provide a self-contained test script to reproduce this?

Is the corruption always the same, ie. "unexpected zero page at block 0" ?

My interpretation of these values is that the drives themselves have
their write caches disabled.

Ok. It doesn't look like a hardware issue, as there's no OS crash involved.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Reply via email to