Andres Freund <and...@anarazel.de> writes: > On 2022-04-07 13:57:45 -0400, Tom Lane wrote: >> Yeah, with only one instance it could just be cosmic rays or something. >> However, assuming it is real, I guess I wonder why we don't say >> CHECKPOINT_FORCE in standby mode too.
> I guess it might partially be that restartpoints require a checkpoint to have > happened on the primary. If we used FORCE, we'd have to wait till the next > checkpoint on the primary, which'd be a problem if it's e.g. a manually issued > CHECKPOINT; before shutting the standby down. After seeing skink's results, I tried running that test under valgrind here, and it fails just like that every time. skink's history allows us to bound the failure introduction between 79b716cfb7 and d7ab2a9a3c, which I think makes it just about certain that it was 5dc0418fab (Prefetch data referenced by the WAL, take II), though I've not bisected to be 100% sure. Adding some debug printouts to ExecuteRecoveryCommand convinces me that indeed the archive_cleanup_command is NOT getting called by the problematic CHECKPOINT command. I surmise based on Andres' comment above that the standby isn't making a restartpoint for lack of an available primary checkpoint, which looks to me like it could be a pre-existing bug in the test case: it's sure not doing anything to guarantee that the primary's checkpoint record has reached the standby. I tried adjusting the patch so it does guarantee that (as attached), and in two out of two tries it got past the archive_cleanup_command failure but then hung up waiting for standby2 to promote. On the whole, I'm not sure that the WAL prefetch logic is noticeably more stable than when we booted it out last year :-(. However, I also wonder why it is that this test case wasn't occasionally failing already. regards, tom lane
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl index c8f5ffbaf0..1032d8a388 100644 --- a/src/test/recovery/t/002_archiving.pl +++ b/src/test/recovery/t/002_archiving.pl @@ -44,13 +44,14 @@ $node_standby->start; # Create some content on primary $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a"); -my $current_lsn = - $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); # Note the presence of this checkpoint for the archive_cleanup_command # check done below, before switching to a new segment. $node_primary->safe_psql('postgres', "CHECKPOINT"); +my $current_lsn = + $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();"); + # Force archiving of WAL file to make it present on primary $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");