On 11/6/13, 12:26 PM, Jeff Ross wrote:
On 11/6/13, 11:32 AM, Jeff Janes wrote:
On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jr...@wykids.org
<mailto:jr...@wykids.org>> wrote:
_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the standby
ssh dukkha.internal rm -rf /wal/*
#stop the standby server if it is running
ssh dukkha.internal sudo /usr/local/bin/svc -d
/service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
--copy-links \
--delete \
--exclude=backup_label \
Excluding backup_label is exactly the wrong thing to do. The only
reason backup_label is created in the first place is so that it can
be copied to the replica, where it is needed. It's existence on the
master is a nuisance.
--exclude=postgresql.conf \
--exclude=recovery.done \
-e ssh -avz /var/postgresql/data.93.5432/ \
dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
ssh dukkha.internal rm -f
/var/postgresql/data.93.5432/pg_xlog/archive_status/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
ssh dukkha.internal ln -s /var/postgresql/recovery.conf
/var/postgresql/data.93.5432/recovery.conf
psql -c "select pg_stop_backup();" template1
ssh dukkha.internal sudo /usr/local/bin/svc -u
/service/postgresql.5432
_postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
+ date +%Y-%m-%d
+ backup_label=wykids_2013-11-06
+ ssh dukkha.internal rm -rf /wal/*
+ ssh dukkha.internal sudo /usr/local/bin/svc -d
/service/postgresql.5432
+ rsync -e ssh /wal/ dukkha.internal:/wal/
skipping directory .
Where is the above rsync coming from? It doesn't seem to be in the
shell script you showed.
Anyway, I think you need to copy the wal over after you call
pg_stop_backup, not before you call pg_start_backup.
Cheers,
Jeff
Hi Jeff,
Thanks for the reply. Oops, I copied one of the many changes to the
script, but not the one with the rsync to copy /wal from the primary
to the standby.
I should have mentioned that wal archiving is setup and working from
the primary to the standby. It saves wal both on the locally on the
primary and remotesly on the standby.
I moved the rsync line to copy wal from primary to secondary after
pg_stop_backup but I'm still getting the same panic on the standby.
Here's the real, honest version of the script I use to start the hot
standby:
_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the secondary
ssh dukkha.internal "rm -rf /wal/*"
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
--copy-links \
--delete \
--exclude=backup_label \
--exclude=postgresql.conf \
--exclude=recovery.done \
-e ssh -avz /var/postgresql/data.93.5432/ \
dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
ssh dukkha.internal "rm -f
/var/postgresql/data.93.5432/pg_xlog/archive_status/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
ssh dukkha.internal "ln -s /var/postgresql/recovery.conf
/var/postgresql/data.93.5432/recovery.conf"
psql -c "select pg_stop_backup();" template1
rsync -e ssh -avz /wal/ dukkha.internal:/wal/
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432
Here are the logs on the standby after running the above:
2013-11-06 11:56:30.792461500 <%> LOG: database system was
interrupted; last known up at 2013-11-06 11:52:22 MST
2013-11-06 11:56:30.800685500 <%> LOG: entering standby mode
2013-11-06 11:56:30.800891500 <%> LOG: invalid primary checkpoint record
2013-11-06 11:56:30.800930500 <%> LOG: invalid secondary checkpoint
record
2013-11-06 11:56:30.801004500 <%> PANIC: could not locate a valid
checkpoint record
Jeff
My apologies to Jeff--I'd missed his in-line comment above that I should
*not* exclude the backup label from the rsync of the primary to the
standby. As soon as I removed that exclusion and with his other
suggested change that I should copy the /wal from the primary to the
standby after pg_stop_backup, streaming replication started on the
standby exactly as it should.
Logs from the standby:
2013-11-07 09:21:15.273712500 <%> LOG: database system was interrupted;
last known up at 2013-11-07 09:16:05 MST
2013-11-07 09:21:15.286834500 <%> LOG: entering standby mode
2013-11-07 09:21:16.873654500 <%> LOG: restored log file
"000000010000000000000050" from archive
2013-11-07 09:21:16.936355500 <%> LOG: redo starts at 0/50000024
2013-11-07 09:21:17.129718500 <%> LOG: consistent recovery state
reached at 0/50036D6C
2013-11-07 09:21:17.131933500 <%> LOG: database system is ready to
accept read only connections
2013-11-07 09:21:17.136856500 cp: /wal/000000010000000000000051: No such
file or directory
2013-11-07 09:21:17.194811500 <%> LOG: started streaming WAL from
primary at 0/51000000 on timeline 1
Jeff Ross