Fix crash during recovery when redo segment is missing

Nitin Jadhav Fri, 21 Feb 2025 03:00:41 -0800

Hi,

In [1], Andres reported a bug where PostgreSQL crashes during recovery
if the segment containing the redo pointer does not exist. I have
attempted to address this issue and I am sharing a patch for the same.

The problem was that PostgreSQL did not PANIC when the redo LSN and
checkpoint LSN were in separate segments, and the file containing the
redo LSN was missing, leading to a crash. Andres has provided a
detailed analysis of the behavior across different settings and
versions. Please refer to [1] for more information. This issue arises
because PostgreSQL does not PANIC initially.

The issue was resolved by ensuring that the REDO location exists once
we successfully read the checkpoint record in InitWalRecovery(). This
prevents control from reaching PerformWalRecovery() unless the WAL
file containing the redo record exists. A new test script,
044_redo_segment_missing.pl, has been added to validate this. To
populate the WAL file with a redo record different from the WAL file
with the checkpoint record, I wait for the checkpoint start message
and then issue a pg_switch_wal(), which should occur before the
completion of the checkpoint. Then, I crash the server, and during the
restart, it should log an appropriate error indicating that it could
not find the redo location. Please let me know if there is a better
way to reproduce this behavior. I have tested and verified this with
the various scenarios Andres pointed out in [1]. Please note that this
patch does not address error checking in StartupXLOG(),
CreateCheckPoint(), etc., nor does it focus on cleaning up existing
code.

Attaching the patch. Please review and share your feedback. Thanks to
Andres for spotting the bug and providing the detailed report [1].

[1]:
https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patch
Description: Binary data

Fix crash during recovery when redo segment is missing

Reply via email to