Hi, I just saw the following failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-04-05%2017%3A47%3A03 after a commit of mine. The symptoms look unrelated though.
[17:54:42.188](258.345s) # poll_query_until timed out executing this query: # SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' # expecting this output: # lost # last actual query output: # unreserved # with stderr: timed out waiting for slot to be lost at /home/bf/bf-build/mylodon/HEAD/pgsql/src/test/recovery/t/019_replslot_limit.pl line 400. We're expecting "lost" but are getting "unreserved". At first I though this was just a race - it's not guaranteed that a checkpoint to remove the WAL files occurs anytime soon. But there might be something else going on - in this case a checkpoint started, but never finished: 2023-04-05 17:50:23.786 UTC [345177] 019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal(); 2023-04-05 17:50:23.787 UTC [342404] LOG: checkpoints are occurring too frequently (2 seconds apart) 2023-04-05 17:50:23.787 UTC [342404] HINT: Consider increasing the configuration parameter "max_wal_size". 2023-04-05 17:50:23.787 UTC [342404] LOG: checkpoint starting: wal 2023-04-05 17:50:23.837 UTC [345264] 019_replslot_limit.pl LOG: statement: CREATE TABLE t (); 2023-04-05 17:50:23.839 UTC [345264] 019_replslot_limit.pl LOG: statement: DROP TABLE t; 2023-04-05 17:50:23.840 UTC [345264] 019_replslot_limit.pl LOG: statement: SELECT pg_switch_wal(); 2023-04-05 17:50:23.841 UTC [342404] LOG: terminating process 344783 to release replication slot "rep3" 2023-04-05 17:50:23.841 UTC [342404] DETAIL: The slot's restart_lsn 0/7000D8 exceeds the limit by 1048360 bytes. 2023-04-05 17:50:23.841 UTC [342404] HINT: You might need to increase max_slot_wal_keep_size. 2023-04-05 17:50:23.862 UTC [344783] standby_3 FATAL: terminating connection due to administrator command 2023-04-05 17:50:23.862 UTC [344783] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 2023-04-05 17:50:23.893 UTC [345314] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' [many repetitions of the above, just differing in time and pid] 2023-04-05 17:54:42.084 UTC [491062] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' 2023-04-05 17:54:42.200 UTC [342365] LOG: received immediate shutdown request 2023-04-05 17:54:42.229 UTC [342365] LOG: database system is shut down Note that a checkpoint started at "17:50:23.787", but didn't finish before the database was shut down. As far as I can tell, this can not be caused by checkpoint_timeout, because by the time we get to invalidating replication slots, we already did CheckPointBuffers(), and that's the only thing that delays based on checkpoint_timeout. ISTM that this indicates that checkpointer got stuck after signalling 344783. Do you see any other explanation? Greetings, Andres Freund