[ starting a new thread cuz the shared-stats one is way too long ] Andres Freund <and...@anarazel.de> writes: > Add minimal tests for recovery conflict handling.
It's been kind of hidden by other buildfarm noise, but 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4]. Three of those failures look like [11:08:46.806](105.129s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/33EF190 on primary [12:01:49.614](3182.807s) # poll_query_until timed out executing this query: # SELECT '0/33EF190' <= replay_lsn AND state = 'streaming' # FROM pg_catalog.pg_stat_replication # WHERE application_name IN ('standby', 'walreceiver') # expecting this output: # t # last actual query output: # f # with stderr: timed out waiting for catchup at t/031_recovery_conflict.pl line 123. In each of these examples we can see in the standby's log that it detected the expected buffer pin conflict: 2022-04-27 11:08:46.353 UTC [1961604][client backend][2/2:0] LOG: statement: BEGIN; 2022-04-27 11:08:46.416 UTC [1961604][client backend][2/2:0] LOG: statement: DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM test_recovery_conflict_table1; 2022-04-27 11:08:46.730 UTC [1961604][client backend][2/2:0] LOG: statement: FETCH FORWARD FROM test_recovery_conflict_cursor; 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] LOG: recovery still waiting after 13.367 ms: recovery conflict on buffer pin 2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] CONTEXT: WAL redo at 0/33E6E80 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: rel 1663/16385/16386, blk 0 but then nothing happens until the script times out and kills the test. I think this is showing us a real bug, ie we sometimes fail to cancel the conflicting query. The other example [3] looks different: [01:02:43.582](2.357s) ok 1 - buffer pin conflict: cursor with conflicting pin established Waiting for replication conn standby's replay_lsn to pass 0/342C000 on primary done [01:02:43.747](0.165s) ok 2 - buffer pin conflict: logfile contains terminated connection due to recovery conflict [01:02:43.804](0.057s) not ok 3 - buffer pin conflict: stats show conflict on standby [01:02:43.805](0.000s) [01:02:43.805](0.000s) # Failed test 'buffer pin conflict: stats show conflict on standby' # at t/031_recovery_conflict.pl line 295. [01:02:43.805](0.000s) # got: '0' # expected: '1' Not sure what to make of that --- could there be a race condition in the reporting of the conflict? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-27%2007%3A16%3A51 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-04-21%2021%3A20%3A15 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-04-13%2022%3A45%3A30 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-11%2005%3A40%3A41