Re: [HACKERS] problems on Solaris

Robert Haas Wed, 27 May 2015 12:45:48 -0700

On Mon, May 25, 2015 at 10:05 PM, Andres Freund <and...@anarazel.de> wrote:
> Hm. So we have a *occasional* stack size exceeded failure and an
> occasional spinlock error in test_shm_mq. I'm inclined to think that
> this is a shm_mq problem, and not a more general locking problem - it
> seems likely, but not guaranteed, that that'd have materialized
> elsewhere.


I think the problem might be that the spinlock-based memory barrier is
not re-entrant.  Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it.  Just
then, we receive a signal.  Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it.  Oops.

> Robert: IIRC there was some problems with shm_mq tests being stuck
> before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rh...@postgresql.org>
Date:   Sat Oct 4 21:25:41 2014 -0400

    Eliminate one background-worker-related flag variable.

    Teach sigusr1_handler() to use the same test for whether a worker
    might need to be started as ServerLoop().  Aside from being perhaps
    a bit simpler, this prevents a potentially-unbounded delay when
    starting a background worker.  On some platforms, select() doesn't
    return when interrupted by a signal, but is instead restarted,
    including a reset of the timeout to the originally-requested value.
    If signals arrive often enough, but no connection requests arrive,
    sigusr1_handler() will be executed repeatedly, but the body of
    ServerLoop() won't be reached.  This change ensures that, even in
    that case, background workers will eventually get launched.

    This is far from a perfect fix; really, we need select() to return
    control to ServerLoop() after an interrupt, either via the self-pipe
    trick or some other mechanism.  But that's going to require more
    work and discussion, so let's do this for now to at least mitigate
    the damage.

    Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster.  To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly.  This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] problems on Solaris

Reply via email to