Hi,

I am running streaming replication with the archive.

As you can see below that Master pg_xlog is at WAL:
000000010000013300000093. But archive_status shows way behind:
000000010000013300000088.done
What could be the reason behind this? How should I let the PostgreSQL
archive the WAL from 000000010000013300000089 to 000000010000013300000092.

pg_xlog/
----------------------------------
-rw------- 1 postgres postgres 16777216 Sep 27 23:30
000000010000013300000082
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000083
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000084
-rw------- 1 postgres postgres 16777216 Sep 27 23:31
000000010000013300000085
-rw------- 1 postgres postgres 16777216 Sep 27 23:51
000000010000013300000086
-rw------- 1 postgres postgres 16777216 Sep 28 00:47
000000010000013300000087
-rw------- 1 postgres postgres 16777216 Sep 28 01:55
000000010000013300000088
-rw------- 1 postgres postgres 16777216 Sep 28 02:47
000000010000013300000089
-rw------- 1 postgres postgres 16777216 Sep 27 18:04
00000001000001330000008A
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008B
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008C
-rw------- 1 postgres postgres 16777216 Sep 27 18:05
00000001000001330000008D
-rw------- 1 postgres postgres 16777216 Sep 27 18:03
00000001000001330000008E
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
00000001000001330000008F
-rw------- 1 postgres postgres 16777216 Sep 27 18:02
000000010000013300000090
-rw------- 1 postgres postgres 16777216 Sep 27 18:37
000000010000013300000091
-rw------- 1 postgres postgres 16777216 Sep 27 20:21
000000010000013300000092
-rw------- 1 postgres postgres 16777216 Sep 27 21:00
000000010000013300000093

pg_xlog/archive_status
----------------------------------
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000081.done
-rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000082.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000083.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000084.done
-rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000085.done
-rw------- 1 postgres postgres 0 Sep 27 23:51 000000010000013300000086.done
-rw------- 1 postgres postgres 0 Sep 28 00:47 000000010000013300000087.done
-rw------- 1 postgres postgres 0 Sep 28 01:55 000000010000013300000088.done

postgresql.conf
#----------------------------------
wal_level = hot_standby
max_wal_senders = 5
wal_keep_segments = 32
archive_mode = on
archive_command = '/opt/wal_archive.sh "%p" "%f" > archive_command.log 2>&1'

wal_archive.sh
------------------------------------
#!/bin/bash -xv
PG_XLOG="$1"
PG_XLOG_FILENAME="$2"
HOST=hostname
ARCHIVE_DIR="/pg_archive/master"
ARCHIVE_TO_KEEP="3"     #No of days of archive logs to keep
EMAIL="a...@b.com"
ERROR_COUNT_FILE="/tmp/replication_archive_error_count.txt"

OLD_COUNT=`cat ${ERROR_COUNT_FILE}`
DNOW=`date +%u`
hour=$(date +%H)
D=`date`

#Do the cleanup if the day is Monday or Thursday and time is between 11
p.m. UTC and 22 hrs UTC
if [ "$DNOW" -eq "1" -o "$DNOW" -eq "4" -a "$hour" -ge 11 -a "$hour" -lt 22
]; then
    find "${ARCHIVE_DIR}"/ -type f -mtime +"${ARCHIVE_TO_KEEP}" -exec rm -f
{} +
    if [ "$?" -eq "1" ]; then
        echo "The wal_archive script could not cleanup the archive
directory of $HOST" | mail -s "ERROR - WAL Archive for $HOST" "$EMAIL"
    fi
fi

if [ ! -f "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" ]; then
    cp "${PG_XLOG}" "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
    /usr/bin/rsync -W -az "${PG_XLOG}" postgres@standby
:"${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
    if [ "$?" -ne "0" ]; then
        #If rsync fails, then remove the copied file from master, increase
the error count, and retry.
        rm -rf "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}"
        NEW_COUNT=`expr $OLD_COUNT + 1`
        if [ "$NEW_COUNT" -ge "100" ]; then
            echo -e "${D}""\n""Rsync could not transfer the WAL file from
Master to slave." | mail -s "ALERT - WAL Archive for $HOST" "$EMAIL"
            echo "0" > $ERROR_COUNT_FILE
        else
            echo "$NEW_COUNT" > $ERROR_COUNT_FILE
        fi
        exit 1
    else
        echo "0" > $ERROR_COUNT_FILE
        exit 0
    fi
else
    exit 0
fi


John Britto, M.Sc.
DevOps Engineer
Sodexis, Inc.
www.sodexis.com
M: +91-0-8012186991
<j...@sodexis.com>


On Thu, Sep 28, 2017 at 1:05 AM, Scott Mead <sco...@openscg.com> wrote:

>
>
> On Wed, Sep 27, 2017 at 2:55 PM, Jerry Sievers <gsiever...@comcast.net>
> wrote:
>
>> John Britto <j...@sodexis.com> writes:
>>
>> > Hello,
>> >
>> > I have a streaming replication setup along with WAL archive.
>> >
>> > archive_command = ‘test ! -f /var/pg_archive/%f && cp %p <archive
>> > location>%f && scp %p postgres@192.168.0.123:<archive location>/%f'
>> >
>> > When the SCP command fails, the master repeatedly tries to send the
>> > archived WAL to standby. But during this time, the pg_xlog directly
>> > grew with newer WAL files.
>> >
>> > The streaming replication hasn't had the problem because on my check,
>> > the WAL write location on the primary was same with the last WAL
>> > location received/replayed in standby.
>> >
>> > Since the pg_xlog in the master had few newer WAL files, the master
>> > archive is lagging to pick the current pg_xlog WAL file.  When a new
>> > WAL occur in the pg_xlog, Master picks the old WAL file to send to
>> > the standby.
>>
>> Yup Pg is going to handle the unshipped WALs one at a time and it will
>> do them in order, oldest (lowest file name) first.
>>
>> > How should I force the PostgreSQL to batch copy the lagging WAL files
>> > to pg_archive and then send to standby.  Can I do this manually using
>> > rsync?  I wonder how PostgreSQL knows the changes because it
>> > maintains info in archive_status with extension as .ready and .done.
>>
>> I suggest you fix your basic archiving routine to complete and exit with
>> success to postgres.
>>
>
> +1
>
> scp %p host:/archive/%f
> if [ "$?" -ne 0 ]
> then
>     echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
>     cp %p /localarchive/%f
>     exit 0
> fi
>
>   Essentially, always make sure that you are returning a 0 to postgres.
> If there is a failure, either log it or handle it separately.  This code
> snippet is *NOT COMPLETE, *there's a lot more to do in order to make it
> production ready and recoverable.  The biggest issue I've had with scp is
> that you have to set and enforce a timeout and trap the timeout.  Note, the
> above only works until your local machine (or the /localarchive partition)
> runs out of space.  It's *really* important that you have ultra solid
> logging and alerting around this.
>
>
>
>
>> And as for archive command scripts in general, simpler is better.
>>
>> If you want to manually ship them in bulk, you may do so but then will
>> need to remove the corresponding archive_status/$foo.ready file so that
>> postgres won't keep trying to ship the same one.
>>
>
> I'm a huge fan of this strategy, especially if you're sending to a remote
> datacenter.
>
> archive_command.sh:
>
> cp %p /localarchive/%f
> if [ "$?" -ne 0 ]
> then
>     echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log
>     exit 0
> fi
>
> send_archive_to_remote.sh
> rsync -avzP /localarchive/* host:/archive/
>
>
> Of course, now you have to deal with files that are removed from the slave
> and making sure they get removed from the master appropriately, but, this
> is fairly straightforward.
>
> --Scott
>
>
>
>
>> HTH
>>
>> > Please assist.
>> >
>> > Thanks,
>> >
>> > John Britto
>> >
>> >
>> >
>>
>> --
>> Jerry Sievers
>> Postgres DBA/Development Consulting
>> e: postgres.consult...@comcast.net
>> p: 312.241.7800
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
>
>
> --
> --
> Scott Mead
> Sr. Architect
> *OpenSCG <http://openscg.com>*
> http://openscg.com
>

Reply via email to