On Fri, Jan 12, 2024 at 3:36 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 9:28 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Thu, Jan 11, 2024 at 10:03 PM Bharath Rupireddy > > <bharath.rupireddyforpostg...@gmail.com> wrote: > > > > > > On Thu, Jan 11, 2024 at 4:35 PM vignesh C <vignes...@gmail.com> wrote: > > > > > > > > On further analysis, it was found that in the failing test, > > > > CHECKPOINT_SHUTDOWN was started in a new page, so there was the WAL > > > > page header present just before the CHECKPOINT_SHUTDOWN which was > > > > causing the failure. We could alternatively reproduce the issue by > > > > switching the WAL file before restarting the server like in the > > > > attached test change patch. > > > > There are a couple of ways to fix this issue a) one by switching the > > > > WAL before the insertion of records so that the CHECKPOINT_SHUTDOWN > > > > does not get inserted in a new page as in the attached test_fix.patch > > > > b) by using pg_walinspect to check that the next WAL record is > > > > CHECKPOINT_SHUTDOWN. I have to try this approach. > > > > > > > > Thanks to Bharath and Kuroda-san for offline discussions and helping > > > > in getting to the root cause. > > > > > > IIUC, the problem the commit e0b2eed tries to solve is to ensure there > > > are no left-over decodable WAL records between confirmed_flush LSN and > > > a shutdown checkpoint, which is what it is expected from the > > > t/038_save_logical_slots_shutdown.pl. How about we have a PG function > > > returning true if there are any decodable WAL records between the > > > given start_lsn and end_lsn? > > > > > > > But, we already test this in 003_logical_slot during a successful > > upgrade. Having an explicit test to do the same thing has some merits > > but not sure if it is worth it. > > If the code added by commit e0b2eed is covered by the new upgrade > test, why not remove 038_save_logical_slots_shutdown.pl altogether? >
This is a more strict check because it is possible that even if the latest confirmed_flush location is not persisted there is no meaningful decodable WAL between whatever the last confirmed_flush location saved on disk and the shutdown_checkpoint record. Kuroda-San/Vignesh, do you have any suggestion on this one? > > The current test tries to ensure that > > during shutdown after we shutdown walsender and ensures that it sends > > all the wal records and receipts an ack for the same, there is no > > other WAL except shutdown_checkpoint. Vignesh's suggestion (a) makes > > the test robust enough that it shouldn't show spurious failures like > > the current one reported by you, so shall we try to proceed with that? > > Do you mean something like [1]? It ensures the test passes unless any > writes are added (in future) before the publisher restarts in the test > which can again make the tests flaky. How do we ensure no one adds > anything in before the publisher restart > 038_save_logical_slots_shutdown.pl? A note before the restart perhaps? > I am fine with adding the note. -- With Regards, Amit Kapila.