On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: > On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg > <jonat...@titanous.com> wrote: > > This issue happened again in production, here are the stack traces from > > three we grabbed before nuking the >400 hanging backends. > > > > [...] > > #4 0x000055fccb93b21c in LWLockAcquire+188() at > > /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 > > #5 0x000055fccb925fa7 in dsm_create+151() at > > /usr/lib/postgresql/10/bin/postgres at dsm.c:493 > > #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at > > /usr/lib/postgresql/10/bin/postgres at parallel.c:266 > > [...] > > Thank you. These stacks are all blocked trying to acquire > DynamicSharedMemoryControlLock. My theory is that they can't because > one backend -- the one that emitted the error "FATAL: cannot unpin a > segment that is not pinned" -- is deadlocked against itself. After > emitting that error you can see from Andreas's "seabisquit" stack that > that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach() > which tries to acquire DynamicSharedMemoryControlLock again, even > though we already hold it at that point. > > I'll write a patch to fix that unpleasant symptom. While holding > DynamicSharedMemoryControlLock we shouldn't raise any errors without > releasing it first, because the error handling path will try to > acquire it again. That's a horrible failure mode as you have > discovered. > > But that isn't the root problem: we shouldn't be raising that error, > and I'd love to see the stack of the one process that did that and > then self-deadlocked. I will have another go at trying to reproduce > it here today.
Thanks for the update! We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of this bug is unfortunately quite problematic. What will this failure look like with the patch you've proposed? Thanks again, Jonathan