Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Andres Freund Fri, 08 Apr 2022 17:59:34 -0700

Hi,

On 2022-04-08 17:55:51 -0400, Tom Lane wrote:
> I tried adjusting the patch so it does guarantee that (as attached),
> and in two out of two tries it got past the archive_cleanup_command
> failure but then hung up waiting for standby2 to promote.


Adding

$node_standby->safe_psql('postgres', "SELECT pg_switch_wal()");
just after
$node_standby2->start;

makes the tests pass here.


What is that second test really testing?

# Check the presence of temporary files specifically generated during
# archive recovery.  To ensure the presence of the temporary history
# file, switch to a timeline large enough to allow a standby to recover
# a history file from an archive.  As this requires at least two timeline
# switches, promote the existing standby first.  Then create a second
# standby based on the promoted one.  Finally, the second standby is
# promoted.

Note "Then create a second standby based on the promoted one." - but that's
not actually what's happening:

$node_standby2->init_from_backup($node_primary, $backup_name,
        has_restoring => 1);

It's created from the original primary, not the first standby, as the
description says...

Both nodes get promoted independently, in a run without valgrind:

standby:
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  INSERT @ 
0/4000058:  - XLOG/END_OF_RECOVERY: tli 2; prev tli 1; time 2022-04-08 
17:23:42.96686-07
2022-04-08 17:23:42.966 PDT [2463835][startup][1/0:0][] LOG:  xlog flush 
request 0/4000058; write 0/4000000; flush 0/4000000

standby2:
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  INSERT @ 
0/4000058:  - XLOG/END_OF_RECOVERY: tli 3; prev tli 1; time 2022-04-08 
17:23:43.307443->
2022-04-08 17:23:43.307 PDT [2463999][startup][1/0:0][] LOG:  xlog flush 
request 0/4000058; write 0/4000000; flush 0/4000000

except that standby2 can't choose tli 2 because it finds it used.

Sure looks like something is funky with that test.


But I think there's also something funky in the prefetching logic. I think it
may attempt restoring during prefetching somehow, even though there's code
that appears to try to prevent that?

on standby2 I can see replay progress like the following:
2022-04-08 17:02:12.310 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024488; 
LSN 0/30244C8: prev 0/3024448; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 60 flags 0x00
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:12.311 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/3024488 for Heap/INSERT: off 60 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:12.312 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
 "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not 
restore file "000000010000000000000004" from archive: child process exited with 
exit code 1
2022-04-08 17:02:13.855 PDT [2441453][startup][1/0:0][] DEBUG:  could not open 
file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/30244C8; 
LSN 0/3024508: prev 0/3024488; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 61 flags 0x00
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:13.856 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/30244C8 for Heap/INSERT: off 61 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:13.857 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
 "pg_wal/RECOVERYXLOG""
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not 
restore file "000000010000000000000004" from archive: child process exited with 
exit code 1
2022-04-08 17:02:15.413 PDT [2441453][startup][1/0:0][] DEBUG:  could not open 
file "pg_wal/000000010000000000000004": No such file or directory
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] LOG:  REDO @ 0/3024508; 
LSN 0/3024548: prev 0/30244C8; xid 725; len 3; blkref #0: rel 1663/5/16384, blk 
4 - Heap/INSERT: off 62 flags 0x00
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] DEBUG:  record known 
xact 725 latestObservedXid 725
2022-04-08 17:02:15.414 PDT [2441453][startup][1/0:0][] CONTEXT:  WAL redo at 
0/3024508 for Heap/INSERT: off 62 flags 0x00; blkref #0: rel 1663/5/16384, blk 4
2022-04-08 17:02:15.415 PDT [2441453][startup][1/0:0][] DEBUG:  executing 
restore command "cp 
"/home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000004"
 "pg_wal/RECOVERYXLOG""

note that we appear to wait between replaying of records, even though we
apparently have WAL for the next record!

And interestingly I'm not seeing the
"switched WAL source from stream to archive after failure"
lines I'd expect.

Greetings,

Andres Freund

Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

Reply via email to