On Mon, May 25, 2015 at 10:05 PM, Andres Freund <and...@anarazel.de> wrote: > Hm. So we have a *occasional* stack size exceeded failure and an > occasional spinlock error in test_shm_mq. I'm inclined to think that > this is a shm_mq problem, and not a more general locking problem - it > seems likely, but not guaranteed, that that'd have materialized > elsewhere.
I think the problem might be that the spinlock-based memory barrier is not re-entrant. Suppose some kind of barrier operation is in process, and we've acquired the dummy spnlock but not yet released it. Just then, we receive a signal. Since the shm_mq code sets set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch. SetLatch now includes barrier operations, so we'll try to acquire and release the spinlock despite already holding it. Oops. > Robert: IIRC there was some problems with shm_mq tests being stuck > before, right? The last round of investigation, on anole, resulted in this fix: commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4 Author: Robert Haas <rh...@postgresql.org> Date: Sat Oct 4 21:25:41 2014 -0400 Eliminate one background-worker-related flag variable. Teach sigusr1_handler() to use the same test for whether a worker might need to be started as ServerLoop(). Aside from being perhaps a bit simpler, this prevents a potentially-unbounded delay when starting a background worker. On some platforms, select() doesn't return when interrupted by a signal, but is instead restarted, including a reset of the timeout to the originally-requested value. If signals arrive often enough, but no connection requests arrive, sigusr1_handler() will be executed repeatedly, but the body of ServerLoop() won't be reached. This change ensures that, even in that case, background workers will eventually get launched. This is far from a perfect fix; really, we need select() to return control to ServerLoop() after an interrupt, either via the self-pipe trick or some other mechanism. But that's going to require more work and discussion, so let's do this for now to at least mitigate the damage. Per investigation of test_shm_mq failures on buildfarm member anole. The problem here isn't really with test_shm_mq; it's with the postmaster. To really make this work properly, we need to be able to use latches in the postmaster, and we need to generalize WaitLatchOrSocket so that it can wait for a latch of any of n sockets. Then ServerLoop can use that instead of calling select directly. This will probably look a lot like what you did to get rid of ImmediateInterruptOK. But all of that seems unrelated to the current problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers