Hi, On 2023-02-18 18:00:00 +0300, Alexander Lakhin wrote: > 18.02.2023 04:06, Andres Freund wrote: > > On 2023-02-18 13:27:04 +1300, Thomas Munro wrote: > > How can a process that we did notify crashing, that has already executed > > SQL statements, end up in MarkPostmasterChildActive()? > > Maybe it's just the backend started for the money test has got > the same PID (5948) that the backend for the name test had?
I somehow mashed name and money into one test in my head... So forget what I wrote. That doesn't really explain the assertion though. It's too bad that we didn't use doesn't include log_connections/log_disconnections. If nothing else, it makes it a lot easier to identify problems like that. We actually do try to configure it for CI, but it currently doesn't work for pg_regress style tests with meson. Need to fix that. Starting a thread. One thing that made me very suspicious when reading related code is this remark: bool ReleasePostmasterChildSlot(int slot) ... /* * Note: the slot state might already be unused, because the logic in * postmaster.c is such that this might get called twice when a child * crashes. So we don't try to Assert anything about the state. */ That seems fragile, and potentially racy. What if we somehow can end up starting another backend inbetween the two ReleasePostmasterChildSlot() calls, we can end up marking a slot that, newly, has a process associated with it, as inactive? Once the slot has been released the first time, it can be assigned again. ISTM that it's not a good idea that we use PM_CHILD_ASSIGNED to signal both, that a slot has not been used yet, and that it's not in use anymore. I think that makes it quite a bit harder to find state management issues. > A simple script that I've found [1] shows that the pids reused rather often > (for me, approximately each 300 process starts in Windows 10 H2), buy maybe > under some circumstances (many concurrent processes?) PIDs can coincide even > so often to trigger that behavior. It's definitely very aggressive in reusing pids - and it seems to intentionally do work to keep pids small. I wonder if it'd be worth trying to exercise this path aggressively by configuring a very low max pid on linux, in an EXEC_BACKEND build. Greetings, Andres Freund