Re: [HACKERS] production server down

Joe Conway Mon, 27 Dec 2004 10:09:27 -0800

Tom Lane wrote:

Are you using one of the scripts that
does an auto initdb if it doesn't see a valid PGDATA?  11 seconds might
be about right for that.

One problem with this theory is how come you didn't get screwed during
*that* boot cycle.  It seems to require assuming that the NFS mount came
online just after the initdb finished (else initdb would have
overwritten the on-NFS pg_control) but before the regular postmaster
started (else this same scenario would have played out then).  That's
not a very wide window.

[followup] We've now had a chance to bring Postgres down and check under the mount point. There *is* indeed a newly initdb'd cluster under there. FWIW the control file is corrupt:

# pg_controldata /home/jconway/pgsql/fds/replica/pgdata
WARNING: Calculated CRC checksum does not match value stored in file.
Either the file is corrupt, or it has a different layout than this program
is expecting.  The results below are untrustworthy.

pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               in production
pg_control last modified:             Sat Feb  6 22:28:16 2106
Current log file ID:                  0
Next log file segment:                10161036
Latest checkpoint location:           0/9AA1B4
Prior checkpoint location:            0/9B0B8C
Latest checkpoint's REDO location:    0/0
Latest checkpoint's UNDO location:    C/218
Latest checkpoint's StartUpID:        17142
Latest checkpoint's NextXID:          1099443932
Latest checkpoint's NextOID:          8192
Time of latest checkpoint:            Wed Apr  8 07:05:36 6325
Database block size:                  1
Blocks per segment of large relation: 128
Maximum length of identifiers:        67
Maximum number of function arguments: 0
Date/time type storage:               floating-point numbers
Maximum length of locale name:        0
LC_COLLATE:
LC_CTYPE:

I have a tarred copy of the under-the-mount PGDATA if anyone is interested in examining it.

BTW, there was another Postgres cluster on this same server which we had not used since the November 2 reboot -- it was corrupt in pretty much the same way and also had an initdb'd cluster under its mount.

So it looks like using an auto initdb startup script is a very bad idea when using an NFS mounted PGDATA. We left the under-mount structure in place and did "chown root:root" and "chmod 000" on it. And, as mentioned in an earlier post, we now rely on the dba to start postgres manually after a server restart.

Joe

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] production server down

Reply via email to