Hi, It looks like there's missing ConditionVariableCancelSleep() in InvalidatePossiblyObsoleteSlot() after waiting for the replication slot to be released by another process. Although prepare to sleep cancels the sleep if the previous one wasn't canceled, the calling process still remains in the CV's wait queue for the last cycle after the replication slot holding process releases the slot. I'm wondering why there's no issue detected if the calling process isn't removed from the CV's wait queue. It may be due to the cancel sleep call in the sigsetjmp() path for almost every process. IMO, it's a clean practice to cancel the sleep immediately after the sleep ends like elsewhere.
Please find the attached patch to fix the issue. Alternatively, we can just add ConditionVariableCancelSleep() outside of the for loop to cancel the sleep of the last cycle if any. This also looks correct because every prepare to sleep does ensure the previous sleep is canceled, and if no sleep at all, the cacnel sleep call exits. PS: I've encountered an assertion failure [1] some time back sporadically in t/001_rep_changes.pl which I've reported in https://www.postgresql.org/message-id/CALj2ACXO6TJ4rhc3Uz3MWJGob9e4H1C71qufH-DGKJh8k4QGZA%40mail.gmail.com. I'm not so sure if this patch fixes the assertion failure. I ran the tests for 100 times [2] on my dev system, I didn't see any issue with the patch. [1] t/001_rep_changes.pl 2024-01-31 12:24:38.474 UTC [840166] pg_16435_sync_16393_7330237333761601891 STATEMENT: DROP_REPLICATION_SLOT pg_16435_sync_16393_7330237333761601891 WAIT TRAP: failed Assert("list->head != INVALID_PGPROCNO"), File: "../../../../src/include/storage/proclist.h", Line: 101, PID: 840166 postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(ExceptionalCondition+0xbb)[0x55c8edf6b8f9] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x6637de)[0x55c8edd517de] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(ConditionVariablePrepareToSleep+0x85)[0x55c8edd51b91] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(ReplicationSlotAcquire+0x142)[0x55c8edcead6b] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(ReplicationSlotDrop+0x51)[0x55c8edceb47f] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x60da71)[0x55c8edcfba71] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(exec_replication_command+0x47e)[0x55c8edcfc96a] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(PostgresMain+0x7df)[0x55c8edd7d644] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x5ab50c)[0x55c8edc9950c] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x5aab21)[0x55c8edc98b21] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x5a70de)[0x55c8edc950de] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(PostmasterMain+0x1534)[0x55c8edc949db] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(+0x459c47)[0x55c8edb47c47] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f19fe629d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f19fe629e40] postgres: publisher: walsender ubuntu postgres [local] DROP_REPLICATION_SLOT(_start+0x25)[0x55c8ed7c4565] 2024-01-31 12:24:38.476 UTC [840168] pg_16435_sync_16390_7330237333761601891 LOG: statement: SELECT a.attnum, a.attname, a.atttypid, a.attnum = ANY(i.indkey) FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_index i ON (i.indexrelid = pg_get_replica_identity_index(16391)) WHERE a.attnum > 0::pg_catalog.int2 AND NOT a.attisdropped AND a.attgenerated = '' AND a.attrelid = 16391 ORDER BY a.attnum [2] for i in {1..100}; do make check PROVE_TESTS="t/001_rep_changes.pl"; if [ $? -ne 0 ]; then echo "The command failed on iteration $i"; break; fi; done -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
v1-0001-Add-missing-ConditionVariableCancelSleep-in-slot..patch
Description: Binary data