Hi Alexander, On Wed, Mar 4, 2026 at 3:00 AM Alexander Lakhin <[email protected]> wrote: > > Hello Heikki, > > 03.03.2026 17:39, Heikki Linnakangas wrote: > > On 24/02/2026 10:00, Alexander Lakhin wrote: > >> The "terminating process ..." message doesn't appear when the test passes > >> successfully. > > > > Hmm, right, looks like something wrong in signaling the recovery conflict. > > I can't tell if the signal is being sent, > > or it's not processed correctly. Looking at the code, I don't see anything > > wrong. > > > >> I've managed to reproduce this with multiple (20) test instances running > >> in a loop (it failed within 10 iterations for me); `git bisect` for this > >> anomaly pointed at 17f51ea81. > > > > I've been trying to reproduce this locally, but so far not success, after > > thousands of iterations. > > > > If you can still reproduce this, can you try it with code changes from the > > attached > > recovery-conflict-fail-extra-logging.patch, which adds some extra logging, > > and send over the logs please? > > > > The recovery-conflict-fail-repro-attempt.patch contains very hacky changes > > to the test, to run the just the failing > > part 100 times in a loop. That's just to show what I used to try to > > reproduce this, but no luck. > > Thank you for your attention to it! > > I can easily reproduce the failure with the attached script when using SSD > (the test didn't fail for me on tmpfs and HDD, probably some tuning needed > ): > ITERATION 1 > ... > 19 t/035_standby_logical_decoding.pl .. ok > 19 All tests successful. > 19 Files=1, Tests=28, 20 wallclock secs ( 0.02 usr 0.01 sys + 0.35 cusr > 0.80 csys = 1.18 CPU) > 19 Result: PASS > 2 # poll_query_until timed out executing this query: > 2 # SELECT '0/0403F950' <= replay_lsn AND state = 'streaming' > 2 # FROM pg_catalog.pg_stat_replication > 2 # WHERE application_name IN ('standby', 'walreceiver') > 2 # expecting this output: > 2 # t > 2 # last actual query output: > 2 # f > 2 # with stderr: > ... > > 035_standby_logical_decoding_standby.log contains: > 2026-03-03 20:34:14.198 EET startup[844699] LOG: invalidating obsolete > replication slot "row_removal_inactiveslot" > 2026-03-03 20:34:14.198 EET startup[844699] DETAIL: The slot conflicted with > xid horizon 748. > 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 > for Heap2/PRUNE_ON_ACCESS: > snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, > ndead: 3, nunused: 1, dead: [33, 34, 35], > unused: [36]; blkref #0: rel 1663/16384/16418, blk 10 > 2026-03-03 20:34:14.198 EET startup[844699] LOG: terminating process 845647 > to release replication slot > "row_removal_activeslot" > 2026-03-03 20:34:14.198 EET startup[844699] DETAIL: The slot conflicted with > xid horizon 748. > 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 > for Heap2/PRUNE_ON_ACCESS: > snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, > ndead: 3, nunused: 1, dead: [33, 34, 35], > unused: [36]; blkref #0: rel 1663/16384/16418, blk 10 > 2026-03-03 20:34:14.198 EET startup[844699] LOG: XXX: SendProcSignal sending > SIGUSR1 to pid 845647 > 2026-03-03 20:34:14.198 EET startup[844699] CONTEXT: WAL redo at 0/040214F0 > for Heap2/PRUNE_ON_ACCESS: > snapshotConflictHorizon: 748, isCatalogRel: T, nplans: 0, nredirected: 0, > ndead: 3, nunused: 1, dead: [33, 34, 35], > unused: [36]; blkref #0: rel 1663/16384/16418, blk 10 > 2026-03-03 20:37:26.827 EET walreceiver[844817] FATAL: could not receive > data from WAL stream: server closed the > connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > The full logs are attached too. I can add any other logging you want. > > Best regards, > Alexander
I was unable to reproduce the issue on an x86_64 Linux machine using the provided script. All test runs completed successfully without any failures. -- Best, Xuneng
