At Thu, 29 Feb 2024 14:05:15 +0900, Michael Paquier <mich...@paquier.xyz> wrote in > On Wed, Feb 28, 2024 at 11:19:41AM +0100, Alexander Kukushkin wrote: > > I spent some time debugging an issue with standby not being able to > > continue streaming after failover. > > > > The problem happens when standbys received only the first part of the WAL > > record that spans multiple pages. > > In this case the promoted standby discards the first part of the WAL record > > and writes END_OF_RECOVERY instead. If in addition to that someone will > > call pg_switch_wal(), then there are chances that SWITCH record will also > > fit to the page where the discarded part was settling, As a result the > > other standby (that wasn't promoted) will infinitely try making attempts to > > decode WAL record span on multiple pages by reading the next page, which is > > filled with zero bytes. And, this next page will never be written, because > > the new primary will be writing to the new WAL file after pg_switch_wal().
In the first place, it's important to note that we do not guarantee that an async standby can always switch its replication connection to the old primary or another sibling standby. This is due to the variations in replication lag among standbys. pg_rewind is required to adjust such discrepancies. I might be overlooking something, but I don't understand how this occurs without purposefully tweaking WAL files. The repro script pushes an incomplete WAL file to the archive as a non-partial segment. This shouldn't happen in the real world. In the repro script, the replication connection of the second standby is switched from the old primary to the first standby after its promotion. After the switching, replication is expected to continue from the beginning of the last replayed segment. But with the script, the second standby copies the intentionally broken file, which differs from the data that should be received via streaming. A similar problem to the issue here was seen at segment boundaries, before we introduced the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag, which prevents overwriting a WAL file that is already archived. However, in this case, the second standby won't see the broken record because it cannot be in a non-partial segment in the archive, and the new primary streams END_OF_RECOVERY instead of the broken record. regards. -- Kyotaro Horiguchi NTT Open Source Software Center