On Mon, Dec 8, 2025 at 3:54 PM Zhijie Hou (Fujitsu) <[email protected]> wrote: > > On Monday, December 8, 2025 5:47 PM Amit Kapila <[email protected]> > wrote: > > > > On Mon, Dec 8, 2025 at 12:53 PM Masahiko Sawada > > <[email protected]> wrote: > > > > > > On Fri, Dec 5, 2025 at 4:10 AM Amit Kapila <[email protected]> > > wrote: > > > > > > > > On Thu, Dec 4, 2025 at 12:12 PM Zhijie Hou (Fujitsu) > > > > <[email protected]> wrote: > > > > > > > > > > Here are the updated patches for HEAD and 18. I did not add tests > > > > > since, after applying the patch and resolving the issue, the only > > > > > observable behavior is that the checkpoint will wait for another > > > > > backend to create a slot due to the lwlock lock, so it seems not > > > > > worth to test solely lwlock wait event (I could not find similar > > > > > tests). > > > > > > > > > > > > > Fair enough. The patch looks mostly good to me, attached are minor > > > > comment improvements atop the HEAD patch. I'll do some more testing > > > > before push. > > > > > > > > Sawada-san/Vitaly, do you have any opinion on patch or the direction > > > > to fix? The idea is to get this fixed for HEAD and 18, then continue > > > > discussion for other bank-branches and the remaining patches. > > > > > > +1 > > > > > > > Thanks, Pushed. I'll continue thinking on how to fix it in branches prior > > to 18 > > and other problems reported in this thread. > > Thanks for pushing. I thought about whether it's possible to apply a similar > fix > to back-branches and one approach could be to take > ReplicationSlotAllocationLock > at two places. E.g., acquire an exclusive lock WAL reservation, and a shared > lock during the minimum LSN calculation at checkpoints to serialize the > process. >
+ * + * Additionally, acquiring the Allocation lock to serialize the minimum LSN + * calculation with concurrent slot WAL reservation. This ensures that the + * WAL position being reserved is either included in the miminum LSN or is + * beyond or equal to the redo pointer of the current checkpoint (See + * ReplicationSlotReserveWal for details). */ + LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED); slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN(); + LWLockRelease(ReplicationSlotAllocationLock); Yeah, this will fix the reported issue but doesn't it look odd to take an unrelated lock here? I mean it appears that if someone has to call XLogGetReplicationSlotMinimumLSN(), they should acquire ReplicationSlotAllocationLock in LW_SHARED mode? If we want to go in this direction and don't have better ideas to fix then we should add comments suggesting this is a special case and shouldn't be used as an example for other places. The other idea to fix this problem is suggested by Alexander in his email [1] which is to introduce a new ReplicationSlotReserveWALLock for this purpose. I think introducing LWLock in back branches could be questionable. Did you evaluate the pros and cons of using that approach? Yet, another possibility is that we don't fix this in back branches prior to 18 but not sure how frequently it can impact users. Suyu, can you please tell how you found this problem in the first place? Is it via code-review or did you hit this in the production or while doing some related tests? BTW, I have asked a question regarding commit 2090edc6f32f652a2c in email [2]. Did you get a chance to look at that? [1] - https://www.postgresql.org/message-id/CAPpHfduZY7_pRCrbLdsLty4zP5x2EDmwk4CYiofiyjdt1iK%2BzA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1%2BwrNSee6PKQ0%2BDtUu_W0GdvewskpAEK5EiX6r3E%2B2Sxw%40mail.gmail.com -- With Regards, Amit Kapila.
