On 2025/05/24 5:41, Kevin K Biju wrote:
Hi,

While creating a logical replication slot, we wait for older transactions to complete to 
reach a "consistent point", which can take a while on busy databases. If we're 
creating a slot on a primary instance, it's pretty clear that we're waiting on a 
transaction:


postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=804;
  pid | wait_event_type |  wait_event   | state  |                             
query
-----+-----------------+---------------+--------+----------------------------------------------------------------
  804 | Lock            | transactionid | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');
(1 row)
postgres=# SELECT locktype,transactionid,pid,mode FROM pg_locks WHERE pid=804 
AND granted='f';
    locktype    | transactionid | pid |   mode
---------------+---------------+-----+-----------
  transactionid |          6077 | 804 | ShareLock

However, creating a slot on a hot standby behaves very differently when blocked 
on a transaction. Querying pg_stat_activity doesn't give us any indication on 
what the issue is:


postgres=# SELECT pg_is_in_recovery();
  pg_is_in_recovery
-------------------
  t
(1 row)
postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=5074;
  pid  | wait_event_type | wait_event | state  |                             
query
------+-----------------+------------+--------+----------------------------------------------------------------
  5074 |                 |            | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');

And more importantly, a backend in this state cannot be terminated via the 
usual methods and sometimes requires a server restart.


postgres=# SELECT pg_cancel_backend(5074), pg_terminate_backend(5074);
  pg_cancel_backend | pg_terminate_backend
-------------------+----------------------
  t                 | t
(1 row)
postgres=# SELECT pid,wait_event_type,wait_event,state,query FROM 
pg_stat_activity WHERE pid=5074;
  pid  | wait_event_type | wait_event | state  |                             
query
------+-----------------+------------+--------+----------------------------------------------------------------
  5074 |                 |            | active | SELECT 
pg_create_logical_replication_slot('wow1', 'pgoutput');

I've taken a look at the code that "awaits" on transactions and the function of 
interest is lmgr.c/XactLockTableWait. On a primary, it is able to acquire a ShareLock on 
the xid and the lock manager does a good job on making this wait efficient as well as 
visible externally. However, on a standby, it cannot wait on the xid since it is not 
running the transaction. However, it knows the transaction is still running from 
KnownAssignedXids, and then ends up on this codepath:

*
* Some uses of this function don't involve tuple visibility -- such
* as when building snapshots for logical decoding.  It is possible to
* see a transaction in ProcArray before it registers itself in the
* locktable.  The topmost transaction in that case is the same xid,
* so we try again after a short sleep.  (Don't sleep the first time
* through, to avoid slowing down the normal case.)
*/
if (!first)
pg_usleep(1000L);

The attached comment suggests that this sleep was only meant to be hit a few 
times while we wait for the lock to appear so we can wait on it. However, in 
the hot standby case, this ends up becoming a polling loop since the lock will 
never appear. There is no interrupt processing in this loop either and so the 
only way out is to wait for the transaction on the primary to complete.

I agree with your analysis. It makes sense to me.


Attached is a patch that adds CHECK_FOR_INTERRUPTS before the sleep in 
XactLockTableWait and ConditionalXactLockTableWait. This allows backends 
waiting on a transaction during slot creation on a standby to be interrupted.

+1

It would also be nice if there was a way for users to tell what we're waiting 
on (maybe a different wait event?) but I'd like input on that.

Just an idea: how about calling pgstat_report_wait_start() and _end() around
pg_usleep(), similar to what WaitExceedsMaxStandbyDelay() does?
Since this is more of an improvement than a bug fix, I think
it would be better to make it as a separate patch from the CFI addition.


Also, would waiting only 1ms per loop cycle be too aggressive? If so, maybe
we could increase the sleep time gradually during recovery (but not beyond
1 second), again similar to WaitExceedsMaxStandbyDelay().

Regards,

--
Fujii Masao
NTT DATA Japan Corporation



Reply via email to