Hi, On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > I tried adjusting the patch so it does guarantee that (as attached), > and in two out of two tries it got past the archive_cleanup_command > failure but then hung up waiting for standby2 to promote.
Adding $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()"); just after $node_standby2->start; makes the tests pass here. What is that second test really testing? # Check the presence of temporary files specifically generated during # archive recovery. To ensure the presence of the temporary history # file, switch to a timeline large enough to allow a standby to recover # a history file from an archive. As this requires at least two timeline # switches, promote the existing standby first. Then create a second # standby based on the promoted one. Finally, the second standby is # promoted. Note "Then create a second standby based on the promoted one." - but that's not actually what's happening: $node_standby2->init_from_backup($node_primary, $backup_name, has_restoring => 1); It's created from the original primary, not the first standby, as the description says... Both nodes get promoted independently, in a run without valgrind: standby: 2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 17:23:42.96686-07 2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000 standby2: 2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: INSERT @ 0/4000058: - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 17:23:43.307443-> 2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG: xlog flush request 0/4000058; write 0/4000000; flush 0/4000000 except that standby2 can't choose tli 2 because it finds it used. Sure looks like something is funky with that test. But I think there's also something funky in the prefetching logic. I think it may attempt restoring during prefetching somehow, even though there's code that appears to try to prevent that? on standby2 I can see replay progress like the following: 2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024488; LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 60 flags 0x00 2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" 2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1 2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/30244C8; LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 61 flags 0x00 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" 2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not restore file "000000010000000000000004" from archive: child process exited with exit code 1 2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG: could not open file "pg_wal/000000010000000000000004": No such file or directory 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG: REDO @ 0/3024508; LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 4 - Heap/INSERT: off 62 flags 0x00 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG: record known xact 725 latestObservedXid 725 2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT: WAL redo at 0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4 2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG: executing restore command "cp "/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004" "pg_wal/RECOVERYXLOG"" note that we appear to wait between replaying of records, even though we apparently have WAL for the next record! And interestingly I'm not seeing the "switched WAL source from stream to archive after failure" lines I'd expect. Greetings, Andres Freund