Hi, on CI [1] the new t/031_recovery_conflict.pl is failing occasionally. Which is interesting, because I ran it there dozens if not hundreds of times before commit, with - I think - only cosmetic changes.
I've reproduced it in a private branch, with more logging. And the results are sure interesting. https://cirrus-ci.com/task/6448492666159104 https://api.cirrus-ci.com/v1/artifact/task/6448492666159104/log/src/test/recovery/tmp_check/log/031_recovery_conflict_standby.log The primary is waiting for 0/343A000 to be applied, which requires a recovery conflict to be detected and resolved. On the standby there's the following sequence (some omitted): prerequisite for recovery conflict: 2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: BEGIN; 2022-04-09 04:05:31.292 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; 2022-04-09 04:05:31.293 UTC [35071][client backend] [031_recovery_conflict.pl][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; detecting the conflict: 2022-04-09 04:05:31.382 UTC [35038][startup] LOG: recovery still waiting after 28.821 ms: recovery conflict on buffer pin 2022-04-09 04:05:31.382 UTC [35038][startup] CONTEXT: WAL redo at 0/3432800 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 and then nothing until the timeout: 2022-04-09 04:09:19.317 UTC [35035][postmaster] LOG: received immediate shutdown request 2022-04-09 04:09:19.317 UTC [35035][postmaster] DEBUG: sending signal 3 to process 35071 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: reaping dead processes 2022-04-09 04:09:19.320 UTC [35035][postmaster] DEBUG: server process (PID 35071) exited with exit code 2 Afaics that has to mean something is broken around sending, receiving or processing of recovery conflict interrupts. All the failures so far were on freebsd, from what I can see. There were other failures in other tests, but I think for reverted or fixed things. Except for not previously triggering while the shmstats patch was in development, it's hard to tell whether it's a regression or just a longstanding bug - we never had tests for recovery conflicts... I don't really see how recovery prefetching could play a role here, clearly we've been trying to replay the record. So we're elsewhere... Greetings, Andres Freund https://cirrus-ci.com/github/postgres/postgres/master