Hi, hackers,

I've recently encountered an issue where a standby crashes when
reconnecting to a new primary after a switchover under certain conditions.
Here's a procedure of the crash scenario:


1) We have three instances: one primary and two standbys (s1 and s2, both
using streaming replication).


2) The primary crashed when the standby’s `flushed_lsn` was at the
beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
following:

   ```

   FATAL: could not connect to the primary server...

   LOG: waiting for WAL to become available at 0/220000ED

   ```


3) s1 was promoted to the new primary, s1 logged the following:

   ```

   LOG: received promote request

   LOG: redo done at 0/21FFFEE8

   LOG: selected new timeline ID: 2

   ```


4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
following:

   ```

   LOG: received SIGHUP, reloading configuration files

   LOG: parameter "primary_conninfo" changed to ...

   ```


5) s2 began replication with s1 and attempted to fetch `0/22000000` on
timeline 2, s2 logged the following:

   ```

   LOG: fetching timeline history file for timeline 2 from primary server

   FATAL: could not start WAL streaming: ERROR: requested starting point
0/22000000 on timeline 1 is not this server's history

   DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.

   LOG: started streaming WAL from primary at 0/22000000 on timeline 2

   ```


6) WAL record mismatch caused the walreceiver process to terminate, s2
logged the following:

   ```

   LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0

   FATAL: terminating walreceiver process due to administrator command

   ```


7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
startup process failed to open the WAL file before it was created, leading
to a crash:

   ```

   PANIC: could not open file "pg_wal/000000020000000000000021": No such
file or directory

   LOG: startup process was terminated by signal 6: Aborted

   ```


In this scenario, s2 attempts replication twice. First, it starts from
`0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
But when a mismatch occurs, the walreceiver process is terminated. On the
second attempt, replication starts from `0/21000000` on timeline 2. The
startup process expects the WAL file to exist because WalRcvFlushRecPtr()
return `0/22000000`, but the file is not found, causing s2's startup
process to crash.

I think it should check recptr and walrcv->flushedUpto when Request
XLogStreming, so I create a patch for it.


Best regards,
pixian shi

Attachment: 0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patch
Description: Binary data

Reply via email to