BUG? Slave don't reconnect to the master

Олег Самойлов Tue, 18 Aug 2020 03:49:42 -0700

Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First 
of all, let me explain situation.


I created a "test bed" (not sure how to call it right), to test high 
availability clusters based on Pacemaker and PostgreSQL. The test bed consist 
of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA 
clusters with different structure. And all 4 HA cluster constantly tested in 
loop: simulated failures with different nature, waited for rising fall-over, 
fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 
virtual machines, with master on one, and sync and async slaves on other. The 
PostgreSQL service is provided by float IPs pointed to working master and 
slaves. Slaves are connected to the master float IP too. When the pacemaker 
detects a failure, for instance, on the master, it promote a master on other 
node with lowest latency WAL and switches float IPs, so the third node keeping 
be a sync slave. My company decided to open this project as an open source, now 
I am finishing formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't 
reconnect to the new master after a failure. First case is PostgreSQL-STOP, 
when I `kill` by STOP signal postgres on the master to simulate freeze. The 
slave don't reconnect to the new master with errors in log:

18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at 
0/1A00DE10

What is strange that error about incorrect WAL is risen  after the termination 
of connection. Well, this can be workarouned by turning off wal receiver 
timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with 
other test. ForkBomb simulates an out of memory situation. In this case a slave 
sometimes don't reconnect to the new master too, with errors in log:

10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server 
closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the 
master IP (float IP) after the wal_retrieve_retry_interval.

BUG? Slave don't reconnect to the master

Reply via email to