Hi, In [1], Andres reported a bug where PostgreSQL crashes during recovery if the segment containing the redo pointer does not exist. I have attempted to address this issue and I am sharing a patch for the same.
The problem was that PostgreSQL did not PANIC when the redo LSN and checkpoint LSN were in separate segments, and the file containing the redo LSN was missing, leading to a crash. Andres has provided a detailed analysis of the behavior across different settings and versions. Please refer to [1] for more information. This issue arises because PostgreSQL does not PANIC initially. The issue was resolved by ensuring that the REDO location exists once we successfully read the checkpoint record in InitWalRecovery(). This prevents control from reaching PerformWalRecovery() unless the WAL file containing the redo record exists. A new test script, 044_redo_segment_missing.pl, has been added to validate this. To populate the WAL file with a redo record different from the WAL file with the checkpoint record, I wait for the checkpoint start message and then issue a pg_switch_wal(), which should occur before the completion of the checkpoint. Then, I crash the server, and during the restart, it should log an appropriate error indicating that it could not find the redo location. Please let me know if there is a better way to reproduce this behavior. I have tested and verified this with the various scenarios Andres pointed out in [1]. Please note that this patch does not address error checking in StartupXLOG(), CreateCheckPoint(), etc., nor does it focus on cleaning up existing code. Attaching the patch. Please review and share your feedback. Thanks to Andres for spotting the bug and providing the detailed report [1]. [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
0001-Fix-crash-during-recovery-when-redo-segment-is-missi.patch
Description: Binary data