On Mon, Sep 06, 2021 at 12:03:32PM -0400, Tom Lane wrote: > I scraped the buildfarm logs looking for similar failures, and didn't > find any. (019_replslot_limit.pl hasn't failed at all in the farm > since the last fix it received, in late July.)
The interesting bits are in 019_replslot_limit_primary3.log. In a failed run, I can see that we get immediately a process termination, as follows: 2021-09-07 07:52:53.402 JST [22890] LOG: terminating process 23082 to release replication slot "rep3" 2021-09-07 07:52:53.442 JST [23082] standby_3 FATAL: terminating connection due to administrator command 2021-09-07 07:52:53.442 JST [23082] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 2021-09-07 07:52:53.452 JST [23133] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' In a successful run, the pattern is different: 2021-09-07 09:27:39.832 JST [57114] standby_3 FATAL: terminating connection due to administrator command 2021-09-07 09:27:39.832 JST [57114] standby_3 STATEMENT: START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 2021-09-07 09:27:39.832 JST [57092] LOG: invalidating slot "rep3" because its restart_lsn 0/7000D8 exceeds max_slot_wal_keep_size 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoint complete: wrote 19 buffers (14.8%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.025 s, sync=0.001 s, total=0.030 s; sync files=0, longest=0.000 s, average=0.000 s; distance=1024 kB, estimate=1024 kB 2021-09-07 09:27:39.833 JST [57092] LOG: checkpoints are occurring too frequently (0 seconds apart) 2021-09-07 09:27:39.833 JST [57092] HINT: Consider increasing the configuration parameter "max_wal_size". 2021-09-07 09:27:39.851 JST [57126] 019_replslot_limit.pl LOG: statement: SELECT wal_status FROM pg_replication_slots WHERE slot_name = 'rep3' The slot invalidation is forgotten because we don't complete a checkpoint that does the work it should do, no? There is a completed checkpoint before we query pg_replication_slots, and the buildfarm shows the same thing. > I wonder if Michael's setup had any unusual settings. The way I use configure and build options has caught bugs with code ordering in the past, but this one looks like just a timing issue with the test itself. I can only see that with Big Sur 11.5.2, and I just got fresh logs this morning with a new failure, as of the attached. -- Michael
replslot_inval.tar.gz
Description: application/tar-gz
signature.asc
Description: PGP signature