Re: [GENERAL] After upgrade to 9.3, streaming replication fails to start--SOLVED

Jeff Ross Thu, 07 Nov 2013 08:30:57 -0800

On 11/6/13, 12:26 PM, Jeff Ross wrote:

On 11/6/13, 11:32 AM, Jeff Janes wrote:
On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jr...@wykids.org<mailto:jr...@wykids.org>> wrote:
    _postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
    #!/bin/sh
    backup_label=wykids_`date +%Y-%m-%d`
    #remove any existing wal files on the standby
    ssh dukkha.internal rm -rf /wal/*
    #stop the standby server if it is running
    ssh dukkha.internal sudo /usr/local/bin/svc -d
    /service/postgresql.5432
    psql -c "select pg_start_backup('$backup_label');" template1
    rsync \
            --copy-links \
            --delete \
            --exclude=backup_label \
Excluding backup_label is exactly the wrong thing to do. The onlyreason backup_label is created in the first place is so that it canbe copied to the replica, where it is needed. It's existence on themaster is a nuisance.
        --exclude=postgresql.conf \
            --exclude=recovery.done \
            -e ssh -avz /var/postgresql/data.93.5432/ \
            dukkha.internal:/var/postgresql/data.93.5432/
    ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
    ssh dukkha.internal rm -f
    /var/postgresql/data.93.5432/pg_xlog/archive_status/*
    ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
    ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
    ssh dukkha.internal ln -s /var/postgresql/recovery.conf
    /var/postgresql/data.93.5432/recovery.conf
    psql -c "select pg_stop_backup();" template1
    ssh dukkha.internal sudo /usr/local/bin/svc -u
    /service/postgresql.5432


    _postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
    + date +%Y-%m-%d
    + backup_label=wykids_2013-11-06
    + ssh dukkha.internal rm -rf /wal/*
    + ssh dukkha.internal sudo /usr/local/bin/svc -d
    /service/postgresql.5432
    + rsync -e ssh /wal/ dukkha.internal:/wal/
    skipping directory .
Where is the above rsync coming from? It doesn't seem to be in theshell script you showed.
Anyway, I think you need to copy the wal over after you callpg_stop_backup, not before you call pg_start_backup.
Cheers,

Jeff
Hi Jeff,
Thanks for the reply. Oops, I copied one of the many changes to thescript, but not the one with the rsync to copy /wal from the primaryto the standby.
I should have mentioned that wal archiving is setup and working fromthe primary to the standby. It saves wal both on the locally on theprimary and remotesly on the standby.
I moved the rsync line to copy wal from primary to secondary afterpg_stop_backup but I'm still getting the same panic on the standby.
Here's the real, honest version of the script I use to start the hotstandby:
_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the secondary
ssh dukkha.internal "rm -rf /wal/*"
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
ssh dukkha.internal "rm -f/var/postgresql/data.93.5432/pg_xlog/archive_status/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
ssh dukkha.internal "ln -s /var/postgresql/recovery.conf/var/postgresql/data.93.5432/recovery.conf"
psql -c "select pg_stop_backup();" template1
rsync -e ssh -avz /wal/ dukkha.internal:/wal/
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432

Here are the logs on the standby after running the above:
2013-11-06 11:56:30.792461500 <%> LOG: database system wasinterrupted; last known up at 2013-11-06 11:52:22 MST
2013-11-06 11:56:30.800685500 <%> LOG:  entering standby mode
2013-11-06 11:56:30.800891500 <%> LOG:  invalid primary checkpoint record
2013-11-06 11:56:30.800930500 <%> LOG: invalid secondary checkpointrecord2013-11-06 11:56:30.801004500 <%> PANIC: could not locate a validcheckpoint record
Jeff

My apologies to Jeff--I'd missed his in-line comment above that I should*not* exclude the backup label from the rsync of the primary to thestandby. As soon as I removed that exclusion and with his othersuggested change that I should copy the /wal from the primary to thestandby after pg_stop_backup, streaming replication started on thestandby exactly as it should.


Logs from the standby:

2013-11-07 09:21:15.273712500 <%> LOG: database system was interrupted;last known up at 2013-11-07 09:16:05 MST

2013-11-07 09:21:15.286834500 <%> LOG:  entering standby mode

2013-11-07 09:21:16.873654500 <%> LOG: restored log file"000000010000000000000050" from archive

2013-11-07 09:21:16.936355500 <%> LOG:  redo starts at 0/50000024

2013-11-07 09:21:17.129718500 <%> LOG: consistent recovery statereached at 0/50036D6C2013-11-07 09:21:17.131933500 <%> LOG: database system is ready toaccept read only connections2013-11-07 09:21:17.136856500 cp: /wal/000000010000000000000051: No suchfile or directory2013-11-07 09:21:17.194811500 <%> LOG: started streaming WAL fromprimary at 0/51000000 on timeline 1


Jeff Ross

Re: [GENERAL] After upgrade to 9.3, streaming replication fails to start--SOLVED

Reply via email to