[BUGS] WAL Receiver Segmentation Fault

Phil Sorber Fri, 28 Dec 2012 10:55:44 -0800

Postgres 9.0.11 running as a hot standby.

The master was restarted and the standby went into a segmentation
fault loop. A hard stop/start fixed it. Here are pertinent logs with
excess and identifying information removed:


2012-12-28 03:39:14 UTC  [16850]: [2-1] FATAL:  replication terminated
by primary server
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:14 UTC  [16801]: [21-1] LOG:  record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:14 UTC  [16798]: [2-1] LOG:  WAL receiver process
(PID 16671) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:14 UTC  [16798]: [3-1] LOG:  terminating any other
active server processes
2012-12-28 03:39:15 UTC  [16798]: [4-1] LOG:  all server processes
terminated; reinitializing
2012-12-28 03:39:15 UTC  [16673]: [1-1] LOG:  database system was
interrupted while in recovery at log time 2012-12-28 03:35:47 UTC
2012-12-28 03:39:15 UTC  [16673]: [2-1] HINT:  If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:39:16 UTC  [16673]: [3-1] LOG:  entering standby mode
zcat: /mnt/dbmount/walarchive/0000000300001A0100000092.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007D.gz: No such
file or directory
2012-12-28 03:39:16 UTC  [16673]: [4-1] LOG:  redo starts at 1A01/7D00C500
zcat: /mnt/dbmount/walarchive/0000000300001A010000007E.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007F.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C0.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C1.gz: No such
file or directory
2012-12-28 03:39:24 UTC  [16681]: [1-1] LOG:  restartpoint starting: xlog
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C2.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C3.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D3.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D4.gz: No such
file or directory
2012-12-28 03:39:28 UTC  [16673]: [5-1] LOG:  consistent recovery
state reached at 1A01/D430F1A0
2012-12-28 03:39:28 UTC  [16798]: [5-1] LOG:  database system is ready
to accept read only connections
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:28 UTC  [16673]: [6-1] LOG:  record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:28 UTC  [16798]: [6-1] LOG:  WAL receiver process
(PID 16870) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:28 UTC  [16798]: [7-1] LOG:  terminating any other
active server processes
2012-12-28 03:39:28 UTC  [16798]: [8-1] LOG:  all server processes
terminated; reinitializing
2012-12-28 03:39:30 UTC  [16871]: [1-1] LOG:  database system was
interrupted while in recovery at log time 2012-12-28 03:35:47 UTC
2012-12-28 03:39:30 UTC  [16871]: [2-1] HINT:  If this has occurred
more than once some data might be corrupted and you might need to
choose an earlier recovery target.
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:39:30 UTC  [16871]: [3-1] LOG:  entering standby mode
zcat: /mnt/dbmount/walarchive/0000000300001A0100000092.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007D.gz: No such
file or directory
2012-12-28 03:39:30 UTC  [16871]: [4-1] LOG:  redo starts at 1A01/7D00C500
zcat: /mnt/dbmount/walarchive/0000000300001A010000007E.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A010000007F.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C0.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C1.gz: No such
file or directory
2012-12-28 03:39:38 UTC  [16883]: [1-1] LOG:  restartpoint starting: xlog
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C2.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000C3.gz: No such
file or directory
...
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D3.gz: No such
file or directory
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D4.gz: No such
file or directory
2012-12-28 03:39:41 UTC  [16871]: [5-1] LOG:  consistent recovery
state reached at 1A01/D430F1A0
2012-12-28 03:39:41 UTC  [16798]: [9-1] LOG:  database system is ready
to accept read only connections
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:41 UTC  [16871]: [6-1] LOG:  record with zero length
at 1A01/D5000078
zcat: /mnt/dbmount/walarchive/0000000300001A01000000D5.gz: No such
file or directory
2012-12-28 03:39:41 UTC  [16798]: [10-1] LOG:  WAL receiver process
(PID 17144) was terminated by signal 11: Segmentation fault
2012-12-28 03:39:41 UTC  [16798]: [11-1] LOG:  terminating any other
active server processes
2012-12-28 03:39:42 UTC  [16798]: [12-1] LOG:  all server processes
terminated; reinitializing

Basically kept doing that over and over until I stopped and started it:

2012-12-28 03:58:22 UTC  [16798]: [161-1] LOG:  received fast shutdown request
2012-12-28 03:58:22 UTC  [983]: [1-1] LOG:  shutting down
2012-12-28 03:58:22 UTC  [983]: [2-1] LOG:  database system is shut down
2012-12-28 03:58:48 UTC  [1219]: [1-1] LOG:  database system was shut
down in recovery at 2012-12-28 03:58:22 UTC
zcat: /mnt/dbmount/walarchive/00000004.history.gz: No such file or directory
zcat: /mnt/dbmount/walarchive/00000003.history.gz: No such file or directory
2012-12-28 03:58:48 UTC  [1219]: [2-1] LOG:  entering standby mode
2012-12-28 03:58:48 UTC  [1219]: [3-1] LOG:  restored log file
"0000000300001A01000000C1" from archive
2012-12-28 03:58:48 UTC  [1219]: [4-1] LOG:  restored log file
"0000000300001A01000000AF" from archive
2012-12-28 03:58:48 UTC  [1219]: [5-1] LOG:  redo starts at 1A01/AF010A98
2012-12-28 03:58:48 UTC  [1219]: [6-1] LOG:  restored log file
"0000000300001A01000000B0" from archive
2012-12-28 03:58:48 UTC  [1219]: [7-1] LOG:  restored log file
"0000000300001A01000000B1" from archive
...
2012-12-28 03:59:10 UTC  [1219]: [50-1] LOG:  restored log file
"0000000300001A01000000DC" from archive
2012-12-28 03:59:10 UTC  [1219]: [51-1] LOG:  restored log file
"0000000300001A01000000DD" from archive
2012-12-28 03:59:10 UTC  [1219]: [52-1] LOG:  consistent recovery
state reached at 1A01/DDED8528
2012-12-28 03:59:10 UTC  [1215]: [1-1] LOG:  database system is ready
to accept read only connections
2012-12-28 03:59:10 UTC  [1219]: [53-1] LOG:  restored log file
"0000000300001A01000000DE" from archive
zcat: /mnt/dbmount/walarchive/0000000300001A01000000DF.gz: No such
file or directory
2012-12-28 03:59:10 UTC  [1219]: [54-1] LOG:  unexpected pageaddr
1A00/F4000000 in log file 6657, segment 223, offset 0
zcat: /mnt/dbmount/walarchive/0000000300001A01000000DF.gz: No such
file or directory
2012-12-28 03:59:10 UTC  [1700]: [1-1] LOG:  streaming replication
successfully connected to primary

I'll note that /mnt/dbmount is on NFS. That might be related to the
problem, but I did nothing to NFS at any point to fix this. It also
never attempts to connect to primary when it couldn't find the
archive.

If there is any more info I can provide, let me know. This is a
production DB so I won't be able to do any disruptive testing. Based
on what I have seen so far, I think this would be difficult to
replicate anyway.

I did a search and this was the only thing related I could find:

http://archives.postgresql.org/pgsql-bugs/2010-04/msg00080.php


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] WAL Receiver Segmentation Fault

Reply via email to