Robert Haas <robertmh...@gmail.com> writes: > On Wed, May 5, 2021 at 3:46 PM Tom Lane <t...@sss.pgh.pa.us> wrote: >> Admittedly, it seems unlikely that the difference could exceed >> MAX_PARALLEL_WORKER_LIMIT = 1024 in a regression test run where >> the limit on number of parallel workers is only 8. What I think is >> more likely, given that these counters are unsigned, is that the >> difference was actually negative. Which could be a bug, or it could >> be an expectable race condition, or it could just be some flakiness >> on lorikeet's part (that machine has had a lot of issues lately).
> I think that assertion was added by me, and I think the thought > process was that the value shouldn't go negative and that if it does > it's probably a bug which we might want to fix. But since the values > are unsigned I could hardly check for < 0, so I did it this way > instead. > But since there's no memory barrier between the two loads, I guess > there's no guarantee that they have the expected relationship, even if > there is a memory barrier on the store side. I wonder if it's worth > trying to tighten that up so that the assertion is more meaningful, or > just give up and rip it out. I'm afraid that if we do have (or > develop) bugs in this area, someone will discover that the effective > max_parallel_workers value on their system slowly drifts up or down > from the configured value, and we'll have no clue where things are > going wrong. The assertion was intended to give us a chance of > noticing that sort of problem in the buildfarm or on a developer's > machine before the code gets out into the real world. I follow your concern, but I'm not convinced that this assertion is a useful aid; first because the asynchrony involved makes the edge cases rather squishy, and second because allowing 1024 bogus increments before complaining will likely mean that developer test runs will not last long enough to trigger the assertion, and third because if it does fire it's too far removed from the perpetrator to be much help in figuring out what went wrong, or even if anything *is* wrong. I've not tried to trace the code, but I'm now a bit suspicious that there is indeed a design bug here. I gather from the comments that parallel_register_count is incremented by the worker processes, which of course implies that a worker that fails to reattach to shared memory won't do that. But parallel_terminate_count is incremented by the postmaster. If the postmaster will do that even in the case of a worker that failed at startup, then lorikeet's symptoms are neatly explained. I'd be more comfortable with this code if the increments and decrements were handled by the same process. regards, tom lane