On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignes...@gmail.com> wrote: > > > > On further analysis, it was found that in the failing test, > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL > > page header present just before the CHECKPOINT_SHUTDOWN which was > > causing the failure. We could alternatively reproduce the issue by > > switching the WAL file before restarting the server like in the > > attached test change patch. > > There are a couple of ways to fix this issue a) one by switching the > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN > > does not get inserted in a new page as in the attached test_fix.patch > > b) by using pg_walinspect to check that the next WAL record is > > CHECKPOINT_SHUTDOWN. I have to try this approach. > > > > Thanks to Bharath and Kuroda-san for offline discussions and helping > > in getting to the root cause. > > IIUC, the problem the commit e0b2eed tries to solve is to ensure there > are no left-over decodable WAL records between confirmed_flush LSN and > a shutdown checkpoint, which is what it is expected from the > t/038_save_logical_slots_shutdown.pl. How about we have a PG function > returning true if there are any decodable WAL records between the > given start_lsn and end_lsn? >
But, we already test this in 003_logical_slot during a successful upgrade. Having an explicit test to do the same thing has some merits but not sure if it is worth it. The current test tries to ensure that during shutdown after we shutdown walsender and ensures that it sends all the wal records and receipts an ack for the same, there is no other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes the test robust enough that it shouldn't show spurious failures like the current one reported by you, so shall we try to proceed with that? -- With Regards, Amit Kapila.