On 2025/06/24 1:32, Xuneng Zhou wrote:
Hi,


Here's patch version 4.


1. The problem

I conducted further investigation on this issue. The 1ms sleep in XactLockTableWait that 
falls back to polling was not problematic in certain scenarios prior to v16. It became an 
potential issue after the "Allow logical decoding on standbys" feature was 
introduced [1]. After that commit, we can now create logical replication slots on hot 
standby servers and perform many other logical decoding operations. [2]



2. The analysis

Is this issue limited to calling sites related to logical decoding? I believe 
so. As we've discussed, the core issue is that primary transactions are not 
running locally on the standby, which breaks the assumption of 
XactLockTableWait—that there is an xid lock to wait on. Other call stacks of 
this function, except those from logical decoding, are unlikely to encounter 
this problem because they are unlikely to be invoked from standby servers.


Analysis of XactLockTableWait calling sites:


Problematic Case (Hot Standby):

- Logical decoding operations in snapbuild.c during snapshot building

- This is thecase that can run on hot standby and experience the issue


Non-problematic Cases (Primary Only):

- Heap operations (heapam.c): DML operations not possible on read-only standby

- B-tree operations (nbtinsert.c): Index modifications impossible on standby

- Logical apply workers (execReplication.c): Cannot start during recovery 
(BgWorkerStart_RecoveryFinished)

- All other cases: Require write operations unavailable on standby



3. The proposed solution

If the above analysis is sound, one potential fix would be to add separate branching for standby in XactLockTableWait. However, this seems inconsistent with the function's definition—there's simply no lock entry in the lock table for waiting. We could implement a new function for this logic,

To be honest, I'm fine with v3, since it only increases the sleep time
after 5000 loop iterations, which has negligible performance impact.
But if these functions aren't intended to be used during recovery and
the loop shouldn't reach that many iterations, I'm also okay with
the v4 approach of introducing a separate function specifically for recovery.

But this amakes me wonder if we should add something like
Assert(!RecoveryInProgress()) to those two functions, to prevent them
from being used during recovery in the future.

but it's hard to find a proper place for it to fit well: lmgr.c is for locks, 
while standby.c or procarray.c are not that ideal either. Therefore, I placed 
the special handling for standby in SnapBuildWaitSnapshot.

Since the purpose and logic of the new function are similar to
XactLockTableWait(), I think it would be better to place it nearby
in lmgr.c, even though it doesn't handle a lock directly. That would
help keep the related logic together and improve readability.

Regards,

--
Fujii Masao
NTT DATA Japan Corporation



Reply via email to