> On Mar 4, 2026, at 07:37, Masahiko Sawada <[email protected]> wrote:
> 
> On Tue, Mar 3, 2026 at 2:49 PM Chao Li <[email protected]> wrote:
>> 
>> 
>> 
>>> On Mar 4, 2026, at 04:48, Masahiko Sawada <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> On Wed, Feb 11, 2026 at 5:07 PM Chao Li <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Feb 11, 2026, at 11:19, Tom Lane <[email protected]> wrote:
>>>>> 
>>>>> Chao Li <[email protected]> writes:
>>>>>> As $SUBJECT says, my understanding is no.
>>>>> 
>>>>> It's not a great idea, but I'm not sure it's fatal.  There are places
>>>>> that hold LWLocks for awhile.
>>>>> 
>>>>>> I think LWLocks are generally only held for a very short duration,
>>>>>> like a few CPU cycles to read or modify some shared data,
>>>>> 
>>>>> Spinlocks are treated that way, but we're willing to hold LWLocks
>>>>> longer.  The main thing I'd be concerned about is that there is no
>>>>> deadlock-detection infrastructure for LWLocks, so you'd better be
>>>>> darn certain there is no possibility of deadlock.  That usually
>>>>> means you want to limit the extent of code that could run while
>>>>> you're holding the lock.
>>>>> 
>>>>> In your specific example, the thing I'd be afraid of is that an
>>>>> errcontext callback might do something you're not expecting.
>>>>> We don't forbid errcontext callbacks from doing catalog lookups,
>>>>> for instance.  So on the whole I agree with this patch, with
>>>>> or without any concrete example that fails.
>>>>> 
>>>>> regards, tom lane
>>>> 
>>>> As Tom has agreed with this patch, added to CF for tracking: 
>>>> https://commitfest.postgresql.org/patch/6477/
>>>> 
>>> 
>>> While I agree with the concern Tom pointed out, I can find that there
>>> are other places where we do ereport() while holding a lwlock. For
>>> instance:
>>> 
>>> src/backend/access/transam/multixact.c:
>>>   else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
>>>   {
>>>       ereport(LOG,
>>>               (errmsg("cannot truncate up to MultiXact %u because it
>>> does not exist on disk, skipping truncation",
>>>                       newOldestMulti)));
>>>       LWLockRelease(MultiXactTruncationLock);
>>>       return;
>>>   }
>>> 
>>> src/backend/commands/vacuum.c:
>>>   if (frozenAlreadyWrapped)
>>>   {
>>>       ereport(WARNING,
>>>               (errmsg("some databases have not been vacuumed in over
>>> 2 billion transactions"),
>>>                errdetail("You might have already suffered
>>> transaction-wraparound data loss.")));
>>>       LWLockRelease(WrapLimitsVacuumLock);
>>>       return;
>>>   }
>>> 
>>> src/backend/postmaster/autovacuum.c:
>>>               LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
>>> 
>>>               /*
>>>                * No other process can put a worker in starting mode, so if
>>>                * startingWorker is still INVALID after exchanging our lock,
>>>                * we assume it's the same one we saw above (so we don't
>>>                * recheck the launch time).
>>>                */
>>>               if (AutoVacuumShmem->av_startingWorker != NULL)
>>>               {
>>>                   worker = AutoVacuumShmem->av_startingWorker;
>>>                   worker->wi_dboid = InvalidOid;
>>>                   worker->wi_tableoid = InvalidOid;
>>>                   worker->wi_sharedrel = false;
>>>                   worker->wi_proc = NULL;
>>>                   worker->wi_launchtime = 0;
>>>                   dclist_push_head(&AutoVacuumShmem->av_freeWorkers,
>>>                                    &worker->wi_links);
>>>                   AutoVacuumShmem->av_startingWorker = NULL;
>>>                   ereport(WARNING,
>>>                           errmsg("autovacuum worker took too long to
>>> start; canceled"));
>>>               }
>>> 
>>> Should we fix these places as well?
>> 
>> Maybe. If needed, I can take a look at them one by one.
>> 
>>> 
>>> Also, if we reverse the ereport() and LWLockRelease() in the specific
>>> example in logicalctl.c, it would happen that a concurrent logical
>>> decoding activation writes the log "logical decoding is enabled upon
>>> creating a new logical replication slot" before the deactivation
>>> "logical decoding is disabled because there are no valid logical
>>> replication slots", confusing users since the logical decoding is
>>> active even though the last log saying "logical decoding is disabled".
>>> 
>> 
>> That sounds reasonable. However, looking at the current code, the “enabling” 
>> log is printed after releasing the lock:
>> ```
>>        LWLockRelease(LogicalDecodingControlLock);
>> 
>>        END_CRIT_SECTION();
>> 
>>        if (!in_recovery)
>>                ereport(LOG,
>>                                errmsg("logical decoding is enabled upon 
>> creating a new logical replication slot"));
>> ```
>> 
>> So the log order is not currently protected by the lock. If we really want 
>> to ensure the ordering between these two messages, we might instead need to 
>> move the “enabling” log before releasing the lock.
> 
> No, IIUC activation happens while a valid logical replication slot is
> held. Since deactivation requires that no slots are present, there's
> no risk of a concurrent deactivation occurring between the
> LWLockRelease() and the ereport().
> 
>> I wonder what you think would be the best way to proceed here.
> 
> Thinking more about this: while the fix might be applicable elsewhere,
> it seems unnecessary here. The deactivation process is restricted to
> the startup and checkpointer processes. Given that these processes
> don't access system catalogs for example, I think there is little or
> zero risk of a deadlock occurring even if we do something in
> errcontext.
> 
> Regards,
> 
> -- 
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Thanks for the explanations. In that case, I will withdraw this patch.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Reply via email to