On Wed, Jan 10, 2024 at 2:08 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > I've been observing a failure in t/038_save_logical_slots_shutdown.pl > of late on my developer system: > > t/038_save_logical_slots_shutdown.pl .. 1/? > # Failed test 'Check that the slot's confirmed_flush LSN is the same > as the latest_checkpoint location' > # at t/038_save_logical_slots_shutdown.pl line 35. > # Looks like you failed 1 test of 2. > t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1 > (wstat 256, 0x100) > Failed 1/2 subtests > > I did a quick analysis of the failure and commit > https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0 > that introduced this test. I think the issue is that the slot's > confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN > (0/1508018) are not the same: > > tmp_check/log/038_save_logical_slots_shutdown_pub.log: > > 2024-01-10 07:55:44.539 UTC [57621] sub LOG: starting logical > decoding for slot "sub" > 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL: Streaming > transactions committing after 0/1508000, reading WAL from 0/1507FC8. > 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT: START_REPLICATION > SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any', > publication_names '"pub"') > > ubuntu:~/postgres$ pg17/bin/pg_controldata -D > src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/ > Database cluster state: in production > pg_control last modified: Wed Jan 10 07:55:44 2024 > Latest checkpoint location: 0/1508018 > Latest checkpoint's REDO location: 0/1508018 > > But the tests added by t/038_save_logical_slots_shutdown.pl expects > both LSNs to be same: > > sub compare_confirmed_flush > { > # Is it same as the value read from log? > ok( $latest_checkpoint eq $confirmed_flush_from_log, > "Check that the slot's confirmed_flush LSN is the same as the > latest_checkpoint location" > ); > > I suspect that it's quite not right to expect the slot's > confirmed_flush and latest checkpoint location to be same in the test. >
As per my understanding, the reason we expect them to be the same is because we ensure that during shutdown, the walsender sends all the WAL just before shutdown_checkpoint and the confirm_flush also points to the end of WAL record before shutodwn_checkpoint. So, the next starting location should be of shutdown_checkpoint record which should ideally be the same. Do you see this failure consistently? -- With Regards, Amit Kapila.