Hi, hackers, I've recently encountered an issue where a standby crashes when reconnecting to a new primary after a switchover under certain conditions. Here's a procedure of the crash scenario:
1) We have three instances: one primary and two standbys (s1 and s2, both using streaming replication). 2) The primary crashed when the standby’s `flushed_lsn` was at the beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the following: ``` FATAL: could not connect to the primary server... LOG: waiting for WAL to become available at 0/220000ED ``` 3) s1 was promoted to the new primary, s1 logged the following: ``` LOG: received promote request LOG: redo done at 0/21FFFEE8 LOG: selected new timeline ID: 2 ``` 4) s2's `primary_conninfo` was updated to point to s1, s2 logged the following: ``` LOG: received SIGHUP, reloading configuration files LOG: parameter "primary_conninfo" changed to ... ``` 5) s2 began replication with s1 and attempted to fetch `0/22000000` on timeline 2, s2 logged the following: ``` LOG: fetching timeline history file for timeline 2 from primary server FATAL: could not start WAL streaming: ERROR: requested starting point 0/22000000 on timeline 1 is not this server's history DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0. LOG: started streaming WAL from primary at 0/22000000 on timeline 2 ``` 6) WAL record mismatch caused the walreceiver process to terminate, s2 logged the following: ``` LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0 FATAL: terminating walreceiver process due to administrator command ``` 7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the startup process failed to open the WAL file before it was created, leading to a crash: ``` PANIC: could not open file "pg_wal/000000020000000000000021": No such file or directory LOG: startup process was terminated by signal 6: Aborted ``` In this scenario, s2 attempts replication twice. First, it starts from `0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`. But when a mismatch occurs, the walreceiver process is terminated. On the second attempt, replication starts from `0/21000000` on timeline 2. The startup process expects the WAL file to exist because WalRcvFlushRecPtr() return `0/22000000`, but the file is not found, causing s2's startup process to crash. I think it should check recptr and walrcv->flushedUpto when Request XLogStreming, so I create a patch for it. Best regards, pixian shi
0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patch
Description: Binary data