[GENERAL] Recovery failure

Klaus Ita Mon, 29 Jul 2013 23:34:21 -0700

Hi List!

On an originally designated hot_standby that had been not so hot for the
last 24h i tried to replay the wal_files left over from a (seemingly
corrupted) master server [missing clog... ].


I get this output:

"/etc/postgresql/9.1/main/postgresql.conf" 584L, 19942C written
root@pgstandby7:/var/lib/postgresql/9.1# /etc/init.d/postgresql restart
Restarting PostgreSQL 9.1 database server: main2013-07-30 06:18:20 UTC
LOG:  database system was interrupted while in recovery at log time
2013-07-29 09:39:03 UTC
2013-07-30 06:18:20 UTC HINT:  If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.
2013-07-30 06:18:20 UTC LOG:  entering standby mode
2013-07-30 06:18:20 UTC LOG:  incomplete startup packet
2013-07-30 06:18:20 UTC LOG:  restored log file "000000010000027A0000002B"
from archive
2013-07-30 06:18:20 UTC LOG:  redo starts at 27A/2B000078
2013-07-30 06:18:21 UTC FATAL:  the database system is starting up
2013-07-30 06:18:21 UTC FATAL:  the database system is starting up
2013-07-30 06:18:21 UTC LOG:  restored log file "000000010000027A0000002C"
from archive
2013-07-30 06:18:22 UTC FATAL:  the database system is starting up
2013-07-30 06:18:22 UTC LOG:  restored log file "000000010000027A0000002D"
from archive
2013-07-30 06:18:22 UTC FATAL:  the database system is starting up
2013-07-30 06:18:22 UTC LOG:  restored log file "000000010000027A0000002E"
from archive
2013-07-30 06:18:22 UTC LOG:  consistent recovery state reached at
27A/2E3F42E8
2013-07-30 06:18:22 UTC PANIC:  _bt_restore_page: cannot add item to page
2013-07-30 06:18:22 UTC CONTEXT:  xlog redo split_r: rel 1663/16405/797541
left 4743, right 18008, next 9681, level 0, firstright 194
2013-07-30 06:18:22 UTC LOG:  startup process (PID 11637) was terminated by
signal 6: Aborted
2013-07-30 06:18:22 UTC LOG:  terminating any other active server processes
The PostgreSQL server failed to start. Please check the log output:
2013-07-30 06:18:20 UTC LOG: database system was interrupted while in
recovery at log time 2013-07-29 09:39:03 UTC 2013-07-30 06:18:20 UTC HINT:
If this has occurred more than once some data might be corrupted and you
might need to choose an earlier recovery target. 2013-07-30 06:18:20 UTC
LOG: entering standby mode 2013-07-30 06:18:20 UTC LOG: incomplete startup
packet 2013-07-30 06:18:20 UTC LOG: restored log file
"000000010000027A0000002B" from archive 2013-07-30 06:18:20 UTC LOG: redo
starts at 27A/2B000078 2013-07-30 06:18:21 UTC FATAL: the database system
is starting up 2013-07-30 06:18:21 UTC FATAL: the database system is
starting up 2013-07-30 06:18:21 UTC LOG: restored log file
"000000010000027A0000002C" from archive 2013-07-30 06:18:22 UTC FATAL: the
database system is starting up 2013-07-30 06:18:22 UTC LOG: restored log
file "000000010000027A0000002D" from archive 2013-07-30 06:18:22 UTC FATAL:
the database system is starting up 2013-07-30 06:18:22 UTC LOG: restored
log file "000000010000027A0000002E" from archive 2013-07-30 06:18:22 UTC
LOG: consistent recovery state reached at 27A/2E3F42E8 2013-07-30 06:18:22
UTC PANIC: _bt_restore_page: cannot add item to page 2013-07-30 06:18:22
UTC CONTEXT: xlog redo split_r: rel 1663/16405/797541 left 4743, right
18008, next 9681, level 0, firstright 194 2013-07-30 06:18:22 UTC LOG:
startup process (PID 11637) was terminated by signal 6: Aborted 2013-07-30
06:18:22 UTC LOG: terminating any other active server processes ... failed!
 failed!
root@pgstandby7:/var/lib/postgresql/9.1# sha1sum
/home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002E.gz
5e9390c165b0885f165ed2bceafdd88692994b8a
/home/validad-pg-backups/pgmaster/wal_files/000000010000027A0000002E.gz





The wal_file itself '000000010000027A0000002E.gz' is the same on all my 3
backup destinations so if there was a corruption, it occured on the master
host. It there a way to 'skip' this wal file? Initially, I agreed with
'greg stark' on pgsql-bugs, that there must have been a filesystem /
whatever problem with my master cluster, but how can that propagate to a
hot/warm standby? shouldn't they write/delete their own files?





mandatory basic info:

#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = ''           # list of custom variable class
names

listen_addresses = '*'          # what IP address(es) to listen on;
max_connections = 25                    # (change requires restart)
timezone = 'Etc/UTC'

shared_buffers = 250MB                  # min 128kB
maintenance_work_mem = 20MB
checkpoint_completion_target = 0.9
effective_cache_size = 200MB

hot_standby = off                       # "on" allows queries during
recovery

max_standby_archive_delay = 90min       # max delay before canceling queries
                                        # when reading WAL from archive;
                                        # -1 allows indefinite delay
max_standby_streaming_delay = 50min     # max delay before canceling queries
                                        # when reading streaming WAL;
                                        # -1 allows indefinite delay





data_directory = '/var/lib/postgresql/9.1/not_main'             # use data
in another directory






root@pgstandby7:/var/lib/postgresql/9.1# cat not_main/recovery.conf
# Note that recovery.conf must be in $PGDATA directory.

# Specifies whether to start the server as a standby. In streaming
replication,
# this parameter must to be set to on.
standby_mode =
'on'

# Specifies a connection string which is used for the standby server to
connect
# with the primary.
primary_conninfo = 'host=pgmaster7.ipv6.oo.net port=5432 user=pgu_replicate
password=passwordpasswordpassword'

# Specifies a command to load archive segments from the WAL archive. If
# wal_keep_segments is a high enough number to retain the WAL segments
# required for the standby server, this may not be necessary. But
# a large workload can cause segments to be recycled before the standby
# is fully synchronized, requiring you to start again from a new base
backup.
restore_command = 'cp /home/validad-pg-backups/pgmaster/wal_files/%f.gz
/tmp/%f.gz && gunzip /tmp/%f.gz && mv /tmp/%f
%p'

# Specifies a trigger file whose presence should cause streaming
replication to
# end (i.e., failover).
trigger_file = '/var/lib/postgresql/9.1/main/stop_replication_trigger'


cannot currently to psql 'select version()' as the beast won't startup ;(,
but it's
ii  postgresql-9.1                   9.1.9-1
amd64        object-relational SQL database, version 9.1 server






thank you,

klaus

[GENERAL] Recovery failure

Reply via email to