Re: PostgreSQL-12 replication failover, pg_rewind fails

Mariya Rampurawala Tue, 12 May 2020 00:25:00 -0700

Hello,
Can someone please help me with the below query?

Regards,
Mariya


From: Mariya Rampurawala <mariya.rampuraw...@veritas.com>
Date: Sunday, 10 May 2020 at 2:55 PM
To: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>, 
"pgsql-gene...@postgresql.org" <pgsql-gene...@postgresql.org>
Subject: PostgreSQL-12 replication failover, pg_rewind fails

Hi,

I am working on providing HA for replication, using automation scripts.
My set up consists on two nodes, Master and Slave. When master fails, The slave 
is promoted to master. But when I try to re-register the old master as slave, 
the pg_rewind command fails. Details below.

Non-default entries in Postgresql.conf

===================================
archive_mode = on
archive_command = 'scp %p root@10.209.56.88:/pg_backup/%f'
archive_cleanup_command = '/bin/ssh root@10.209.56.88 
"/usr/pgsql-12/bin/pg_archivecleanup /pg_backup %r"'

synchronous_standby_names = '*'

listen_addresses = '*'

wal_level = hot_standby

max_wal_senders = 32

max_replication_slots = 8

hot_standby = on

wal_log_hints = on

synchronous_commit = on

hot_standby_feedback = on

restart_after_crash = off

max_connections = 100

===================================



Postgresql.auto.conf at standby:

===================================
primary_conninfo = 'user=replicator host=10.209.57.16 port=5432 sslmode=prefer 
sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
restore_command = 'scp  root@10.209.56.88:/pg_backup/%f %p'
recovery_target_timeline='latest'

===================================





Reproduction steps:

  1.  Start master:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/  start

waiting for server to start....2020-05-11 05:26:01.665 IST [23999] LOG:  
starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432

2020-05-11 05:26:01.665 IST [23999] LOG:  listening on IPv6 address "::", port 
5432

2020-05-11 05:26:01.666 IST [23999] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:26:01.668 IST [23999] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"

2020-05-11 05:26:01.688 IST [23999] LOG:  redirecting log output to logging 
collector process

2020-05-11 05:26:01.688 IST [23999] HINT:  Future log output will appear in 
directory "log".

done

server started

  1.  Basebackup on slave:

/usr/pgsql-12/bin/pg_basebackup -D /pg_mnt/pg-12/data -X stream -R -c fast -U 
replicator --host=10.209.57.16 --port=5432

  1.  Copy postgresql.auto.conf to slave data directory and touch standby.signal
  2.  Start slave

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data start

waiting for server to start....2020-05-11 05:29:09.422 IST [22624] LOG:  
starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432

2020-05-11 05:29:09.422 IST [22624] LOG:  listening on IPv6 address "::", port 
5432

2020-05-11 05:29:09.423 IST [22624] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:29:09.425 IST [22624] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"

2020-05-11 05:29:09.442 IST [22624] LOG:  redirecting log output to logging 
collector process

2020-05-11 05:29:09.442 IST [22624] HINT:  Future log output will appear in 
directory "log".

.. done

server started

  1.  Check replication status on master:

-bash-4.2$ psql

psql (12.2)

Type "help" for help.



postgres=# \x

Expanded display is on.

postgres=# select * from pg_stat_replication;

-[ RECORD 1 ]----+---------------------------------

pid              | 24988

usesysid         | 16385

usename          | replicator

application_name | walreceiver

client_addr      | 10.209.57.14

client_hostname  |

client_port      | 40802

backend_start    | 2020-05-11 05:29:14.752341+05:30

backend_xmin     | 588

state            | streaming

sent_lsn         | 5/A60000D8

write_lsn        | 5/A60000D8

flush_lsn        | 5/A60000D8

replay_lsn       | 5/A60000D8

write_lag        |

flush_lag        |

replay_lag       |

sync_priority    | 1

sync_state       | sync

reply_time       | 2020-05-11 05:29:54.895857+05:30



  1.  From a third node write large data on master
#psql -h 10.209.57.16 --user=postgres
psql (12.2)
Type "help" for help.

postgres=# \c vcs
You are now connected to database "vcs" as user "postgres".
vcs=# create table c1 as select generate_series(1,10000000000);


  1.  Kill postgres processes on master:
-bash-4.2$ ps -ef | grep post
root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf
root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres
postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash
root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w
postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u
postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u
postgres 23999     1  0 05:26 ?        00:00:00 /usr/pgsql-12/bin/postgres -D 
/pg_mnt/pg-12/data
postgres 24000 23999  0 05:26 ?        00:00:00 postgres: logger
postgres 24002 23999  0 05:26 ?        00:00:00 postgres: checkpointer
postgres 24003 23999  0 05:26 ?        00:00:00 postgres: background writer
postgres 24004 23999  0 05:26 ?        00:00:01 postgres: walwriter
postgres 24005 23999  0 05:26 ?        00:00:00 postgres: autovacuum launcher
postgres 24006 23999  0 05:26 ?        00:00:00 postgres: archiver   archiving 
0000003500000005000000C7
postgres 24007 23999  0 05:26 ?        00:00:00 postgres: stats collector
postgres 24008 23999  0 05:26 ?        00:00:00 postgres: logical replication 
launcher
postgres 24988 23999  1 05:29 ?        00:00:03 postgres: walsender replicator 
10.209.57.14(40802) streaming 6/4A480000
postgres 25906 23999 76 05:31 ?        00:00:47 postgres: postgres vcs 
10.209.57.15(58642) CREATE TABLE AS
postgres 26240 24006  1 05:32 ?        00:00:00 scp 
pg_wal/0000003500000005000000C7 
root@10.209.56.88:/pg_backup/0000003500000005000000C7
postgres 26241 26240  0 05:32 ?        00:00:00 /usr/bin/ssh -x 
-oForwardAgent=no -oPermitLocalCommand=no -oClearAllForwardings=yes -l root -- 
10.209.56.88 scp -t /pg_backup/0000003500000005000000C7
postgres 26242 32277  0 05:32 pts/0    00:00:00 ps -ef
postgres 26243 32277  0 05:32 pts/0    00:00:00 grep --color=auto post
root     32276 24501  0 May10 pts/0    00:00:00 su - postgres
postgres 32277 32276  0 May10 pts/0    00:00:00 -bash
-bash-4.2$ kill -9 23999
-bash-4.2$ ps -ef | grep post
root      1860 18990  0 May08 pts/1    00:00:00 vim postgresql.cluster.conf
root      1925 14243  0 Apr07 pts/0    00:00:00 su - postgres
postgres  1926  1925  0 Apr07 pts/0    00:00:00 -bash
root      7321     1  0 Mar14 ?        00:00:15 /usr/libexec/postfix/master -w
postfix   7350  7321  0 Mar14 ?        00:00:03 qmgr -l -t unix -u
postfix  10410  7321  0 04:47 ?        00:00:00 pickup -l -t unix -u
postgres 24000     1  0 05:26 ?        00:00:00 postgres: logger
postgres 25906     1 81 05:31 ?        00:01:30 postgres: postgres vcs 
10.209.57.15(58642) CREATE TABLE AS
postgres 26514 32277  0 05:33 pts/0    00:00:00 ps -ef
postgres 26515 32277  0 05:33 pts/0    00:00:00 grep --color=auto post
root     32276 24501  0 May10 pts/0    00:00:00 su - postgres
postgres 32277 32276  0 May10 pts/0    00:00:00 -bash


  1.  Promote slave:

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data promote

waiting for server to promote..... done

server promoted

-bash-4.2$ psql

psql (12.2)

Type "help" for help.



postgres=# checkpoint;

CHECKPOINT

postgres=#

  1.  Re-register old master as standby

-bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data 
--source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: fatal: target server must be shut down cleanly

  1.  Start and stop db to finish recovery (old master)

-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart start

pg_ctl: another server might be running; trying to start server anyway

waiting for server to start....2020-05-11 05:36:29.005 IST [27684] LOG:  
starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv4 address "0.0.0.0", 
port 5432

2020-05-11 05:36:29.006 IST [27684] LOG:  listening on IPv6 address "::", port 
5432

2020-05-11 05:36:29.007 IST [27684] LOG:  listening on Unix socket 
"/var/run/postgresql/.s.PGSQL.5432"

2020-05-11 05:36:29.008 IST [27684] LOG:  listening on Unix socket 
"/tmp/.s.PGSQL.5432"

2020-05-11 05:36:29.027 IST [27684] LOG:  redirecting log output to logging 
collector process

2020-05-11 05:36:29.027 IST [27684] HINT:  Future log output will appear in 
directory "log".

........................................................... stopped waiting

pg_ctl: server did not start in time

 -bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ status

pg_ctl: server is running (PID: 27684)

/usr/pgsql-12/bin/postgres "-D" "/pg_mnt/pg-12/data"



-bash-4.2$ /usr/pgsql-12/bin/pg_ctl -D /pg_mnt/pg-12/data/ -m smart stop

waiting for server to shut down..... done

server stopped



  1.  Rewind again:
  2.  -bash-4.2$ /usr/pgsql-12/bin/pg_rewind -D /pg_mnt/pg-12/data 
--source-server="host=10.209.57.17  port=5432 user=postgres dbname=postgres"

pg_rewind: servers diverged at WAL location 6/B9FFFFD8 on timeline 53

pg_rewind: error: could not open file 
"/pg_mnt/pg-12/data/pg_wal/0000003500000006000000B9": No such file or directory

pg_rewind: fatal: could not find previous WAL record at 6/B9FFFFD8




I have tried this multiple times but always face the same error. Can someone 
help me resolve this?




Thanks and Regards,
Mariya

Re: PostgreSQL-12 replication failover, pg_rewind fails

Reply via email to