Hi, I am running streaming replication with the archive.
As you can see below that Master pg_xlog is at WAL: 000000010000013300000093. But archive_status shows way behind: 000000010000013300000088.done What could be the reason behind this? How should I let the PostgreSQL archive the WAL from 000000010000013300000089 to 000000010000013300000092. pg_xlog/ ---------------------------------- -rw------- 1 postgres postgres 16777216 Sep 27 23:30 000000010000013300000082 -rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000083 -rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000084 -rw------- 1 postgres postgres 16777216 Sep 27 23:31 000000010000013300000085 -rw------- 1 postgres postgres 16777216 Sep 27 23:51 000000010000013300000086 -rw------- 1 postgres postgres 16777216 Sep 28 00:47 000000010000013300000087 -rw------- 1 postgres postgres 16777216 Sep 28 01:55 000000010000013300000088 -rw------- 1 postgres postgres 16777216 Sep 28 02:47 000000010000013300000089 -rw------- 1 postgres postgres 16777216 Sep 27 18:04 00000001000001330000008A -rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008B -rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008C -rw------- 1 postgres postgres 16777216 Sep 27 18:05 00000001000001330000008D -rw------- 1 postgres postgres 16777216 Sep 27 18:03 00000001000001330000008E -rw------- 1 postgres postgres 16777216 Sep 27 18:02 00000001000001330000008F -rw------- 1 postgres postgres 16777216 Sep 27 18:02 000000010000013300000090 -rw------- 1 postgres postgres 16777216 Sep 27 18:37 000000010000013300000091 -rw------- 1 postgres postgres 16777216 Sep 27 20:21 000000010000013300000092 -rw------- 1 postgres postgres 16777216 Sep 27 21:00 000000010000013300000093 pg_xlog/archive_status ---------------------------------- -rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000081.done -rw------- 1 postgres postgres 0 Sep 27 23:30 000000010000013300000082.done -rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000083.done -rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000084.done -rw------- 1 postgres postgres 0 Sep 27 23:31 000000010000013300000085.done -rw------- 1 postgres postgres 0 Sep 27 23:51 000000010000013300000086.done -rw------- 1 postgres postgres 0 Sep 28 00:47 000000010000013300000087.done -rw------- 1 postgres postgres 0 Sep 28 01:55 000000010000013300000088.done postgresql.conf #---------------------------------- wal_level = hot_standby max_wal_senders = 5 wal_keep_segments = 32 archive_mode = on archive_command = '/opt/wal_archive.sh "%p" "%f" > archive_command.log 2>&1' wal_archive.sh ------------------------------------ #!/bin/bash -xv PG_XLOG="$1" PG_XLOG_FILENAME="$2" HOST=hostname ARCHIVE_DIR="/pg_archive/master" ARCHIVE_TO_KEEP="3" #No of days of archive logs to keep EMAIL="a...@b.com" ERROR_COUNT_FILE="/tmp/replication_archive_error_count.txt" OLD_COUNT=`cat ${ERROR_COUNT_FILE}` DNOW=`date +%u` hour=$(date +%H) D=`date` #Do the cleanup if the day is Monday or Thursday and time is between 11 p.m. UTC and 22 hrs UTC if [ "$DNOW" -eq "1" -o "$DNOW" -eq "4" -a "$hour" -ge 11 -a "$hour" -lt 22 ]; then find "${ARCHIVE_DIR}"/ -type f -mtime +"${ARCHIVE_TO_KEEP}" -exec rm -f {} + if [ "$?" -eq "1" ]; then echo "The wal_archive script could not cleanup the archive directory of $HOST" | mail -s "ERROR - WAL Archive for $HOST" "$EMAIL" fi fi if [ ! -f "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" ]; then cp "${PG_XLOG}" "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" /usr/bin/rsync -W -az "${PG_XLOG}" postgres@standby :"${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" if [ "$?" -ne "0" ]; then #If rsync fails, then remove the copied file from master, increase the error count, and retry. rm -rf "${ARCHIVE_DIR}"/"${PG_XLOG_FILENAME}" NEW_COUNT=`expr $OLD_COUNT + 1` if [ "$NEW_COUNT" -ge "100" ]; then echo -e "${D}""\n""Rsync could not transfer the WAL file from Master to slave." | mail -s "ALERT - WAL Archive for $HOST" "$EMAIL" echo "0" > $ERROR_COUNT_FILE else echo "$NEW_COUNT" > $ERROR_COUNT_FILE fi exit 1 else echo "0" > $ERROR_COUNT_FILE exit 0 fi else exit 0 fi John Britto, M.Sc. DevOps Engineer Sodexis, Inc. www.sodexis.com M: +91-0-8012186991 <j...@sodexis.com> On Thu, Sep 28, 2017 at 1:05 AM, Scott Mead <sco...@openscg.com> wrote: > > > On Wed, Sep 27, 2017 at 2:55 PM, Jerry Sievers <gsiever...@comcast.net> > wrote: > >> John Britto <j...@sodexis.com> writes: >> >> > Hello, >> > >> > I have a streaming replication setup along with WAL archive. >> > >> > archive_command = ‘test ! -f /var/pg_archive/%f && cp %p <archive >> > location>%f && scp %p postgres@192.168.0.123:<archive location>/%f' >> > >> > When the SCP command fails, the master repeatedly tries to send the >> > archived WAL to standby. But during this time, the pg_xlog directly >> > grew with newer WAL files. >> > >> > The streaming replication hasn't had the problem because on my check, >> > the WAL write location on the primary was same with the last WAL >> > location received/replayed in standby. >> > >> > Since the pg_xlog in the master had few newer WAL files, the master >> > archive is lagging to pick the current pg_xlog WAL file. When a new >> > WAL occur in the pg_xlog, Master picks the old WAL file to send to >> > the standby. >> >> Yup Pg is going to handle the unshipped WALs one at a time and it will >> do them in order, oldest (lowest file name) first. >> >> > How should I force the PostgreSQL to batch copy the lagging WAL files >> > to pg_archive and then send to standby. Can I do this manually using >> > rsync? I wonder how PostgreSQL knows the changes because it >> > maintains info in archive_status with extension as .ready and .done. >> >> I suggest you fix your basic archiving routine to complete and exit with >> success to postgres. >> > > +1 > > scp %p host:/archive/%f > if [ "$?" -ne 0 ] > then > echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log > cp %p /localarchive/%f > exit 0 > fi > > Essentially, always make sure that you are returning a 0 to postgres. > If there is a failure, either log it or handle it separately. This code > snippet is *NOT COMPLETE, *there's a lot more to do in order to make it > production ready and recoverable. The biggest issue I've had with scp is > that you have to set and enforce a timeout and trap the timeout. Note, the > above only works until your local machine (or the /localarchive partition) > runs out of space. It's *really* important that you have ultra solid > logging and alerting around this. > > > > >> And as for archive command scripts in general, simpler is better. >> >> If you want to manually ship them in bulk, you may do so but then will >> need to remove the corresponding archive_status/$foo.ready file so that >> postgres won't keep trying to ship the same one. >> > > I'm a huge fan of this strategy, especially if you're sending to a remote > datacenter. > > archive_command.sh: > > cp %p /localarchive/%f > if [ "$?" -ne 0 ] > then > echo "`date`:FAILED:%p" >> /var/log/failed_xlog.log > exit 0 > fi > > send_archive_to_remote.sh > rsync -avzP /localarchive/* host:/archive/ > > > Of course, now you have to deal with files that are removed from the slave > and making sure they get removed from the master appropriately, but, this > is fairly straightforward. > > --Scott > > > > >> HTH >> >> > Please assist. >> > >> > Thanks, >> > >> > John Britto >> > >> > >> > >> >> -- >> Jerry Sievers >> Postgres DBA/Development Consulting >> e: postgres.consult...@comcast.net >> p: 312.241.7800 >> >> >> -- >> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-general >> > > > > -- > -- > Scott Mead > Sr. Architect > *OpenSCG <http://openscg.com>* > http://openscg.com >