Hi all, I have a new 9.1.1 hot standby machine (VM). It consistently starts up, goes through recovery through these same WAL segments, and then gets a "Bus error":
2011-11-30 20:57:37 EST LOG: database system was interrupted while in recovery at log time 2011-11-30 09:37:11 EST 2011-11-30 20:57:37 EST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2011-11-30 20:57:37 EST LOG: entering standby mode, pausing. 2011-11-30 20:57:57 EST LOG: restored log file "000000010000005D00000045" from archive 2011-11-30 20:57:58 EST LOG: redo starts at 5D/4560B61C 2011-11-30 20:57:59 EST LOG: restored log file "000000010000005D00000046" from archive 2011-11-30 20:58:00 EST LOG: restored log file "000000010000005D00000047" from archive 2011-11-30 20:58:03 EST LOG: restored log file "000000010000005D00000048" from archive 2011-11-30 20:58:04 EST LOG: restored log file "000000010000005D00000049" from archive 2011-11-30 20:58:05 EST LOG: restored log file "000000010000005D0000004A" from archive 2011-11-30 20:58:06 EST LOG: restored log file "000000010000005D0000004B" from archive 2011-11-30 20:58:07 EST LOG: restored log file "000000010000005D0000004C" from archive 2011-11-30 20:58:07 EST LOG: startup process (PID 23167) was terminated by signal 7: Bus error 2011-11-30 20:58:07 EST LOG: terminating any other active server processes After I saw the same problem a few times in a row, I stuck a sleep(20); call right after the "entering standby mode" message (hence the modified message "entering standby mode, pausing"), so I could have a few seconds to attach to the startup process with gdb and get a backtrace; I'm not sure why, but a core dump wasn't produced automatically, even though I think I've set ulimit appropriately. Anyway, here is the backtrace: (gdb) continue Continuing. Program received signal SIGBUS, Bus error. 0x0825b329 in ReadBuffer_common (smgr=0xa0be758, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=98560, mode=RBM_ZERO, strategy=0x0, hit=0xbfa0403f "") at /usr/include/bits/string3.h:85 85 return __builtin___memset_chk (__dest, __ch, __len, __bos0 (__dest)); (gdb) bt #0 0x0825b329 in ReadBuffer_common (smgr=0xa0be758, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=98560, mode=RBM_ZERO, strategy=0x0, hit=0xbfa0403f "") at /usr/include/bits/string3.h:85 #1 0x0825bcdf in ReadBufferWithoutRelcache (rnode=..., forkNum=MAIN_FORKNUM, blockNum=98560, mode=RBM_ZERO, strategy=0x0) at bufmgr.c:272 #2 0x080d4998 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=98560, mode=RBM_ZERO) at xlogutils.c:296 #3 0x080c6b85 in RestoreBkpBlocks (lsn=..., record=0xa0b53c8, cleanup=0 '\000') at xlog.c:3589 #4 0x080dd294 in gin_redo (lsn=..., record=0xa0b53c8) at ginxlog.c:676 #5 0x080d1759 in StartupXLOG () at xlog.c:6599 #6 0x080d42a5 in StartupProcessMain () at xlog.c:10028 #7 0x080eb637 in AuxiliaryProcessMain (argc=2, argv=<value optimized out>) at bootstrap.c:417 #8 0x0823e5ac in StartChildProcess (type=StartupProcess) at postmaster.c:4498 #9 0x08240afe in PostmasterMain (argc=3, argv=0xa083838) at postmaster.c:1111 #10 0x081e1a88 in main (argc=3, argv=0xa083838) at main.c:199 (gdb) Some more details about my setup: The hot standby is an Ubuntu 11.04 32-bit OpenVZ VM. The master is a Debian Squeeze 32-bit VM. Both master and hot standby are running 9.1.1 compiled from source with: ./configure --prefix=/home/postgres/runtime/ --enable-debug --with-python The master is archiving WAL files to the hot standby using rsync. I initialized $PGDATA on the standby using a simple: rsync -avz postgres@master:/home/postgres/pgdata /home/postgres/ I'll attach my non-default pg_settings and recovery.conf file as well for completeness. I notice gin_redo() in the backtrace; I am using a full-text index like this: "xxx_txtsearch" gin (to_tsvector('english'::regconfig, title || body)) on a large partitioned table. Since the standby VM is new (I've used the same hosting provider before, but this is the first time I tried an Ubuntu VM), it's entirely possible I messed something up in the set-up. Or it could be some flakiness with the VM itself. But before I start blowing things away and trying to start from scratch, I figured I'd leave the machine alone for a bit in case this report is interesting and any more sleuthing would help. Josh
pgdata# SELECT name, setting FROM pg_settings WHERE source != 'default'; name | setting ----------------------------+--------------------------------------------------- ------- application_name | psql archive_command | rsync -a %p postgres@standby:/home/postgres/wal_archive/%f archive_mode | on checkpoint_segments | 30 client_encoding | UTF8 config_file | /home/postgres/pgdata/postgresql.conf data_directory | /home/postgres/pgdata DateStyle | ISO, MDY default_text_search_config | pg_catalog.english effective_cache_size | 49152 fsync | on hba_file | /home/postgres/pgdata/pg_hba.conf hot_standby | on ident_file | /home/postgres/pgdata/pg_ident.conf lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 lc_messages | en_US.UTF-8 lc_monetary | en_US.UTF-8 lc_numeric | en_US.UTF-8 lc_time | en_US.UTF-8 listen_addresses | * log_line_prefix | %t log_min_duration_statement | 1000 log_timezone | America/New_York logging_collector | on maintenance_work_mem | 65536 max_connections | 15 max_stack_depth | 2048 max_wal_senders | 2 server_encoding | UTF8 shared_buffers | 16384 synchronous_commit | off TimeZone | America/New_York transaction_deferrable | off transaction_isolation | read committed transaction_read_only | off wal_buffers | 512 wal_keep_segments | 32 wal_level | hot_standby work_mem | 24576 (40 rows)
recovery.conf
Description: Binary data
-- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs