Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
> Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> > I noticed while doing some tests that the checkpointer process does not
> > recover very nicely after a backend crashes under postmaster -T (after
> > all processes have been kill -CONTd, of course, and postmaster told to
> > shutdown via Ctrl-C on its console).  For some reason it seems to get
> > stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
> > do a checkpoint, but it was progressing a single page each time and then
> > sleeping.  In that condition, the checkpoint took a very long time to
> > finish.
> 
> Is this still a problem as of HEAD?  I think I've fixed some issues in
> the checkpointer's outer loop logic, but not sure if what you saw is
> still there.

Yep, it's still there as far as I can tell.  A backtrace from the
checkpointer shows it's waiting on the latch.

It seems to me that the bug is in the postmaster state machine rather
than checkpointer itself.  After a few false starts, this seems to fix
it:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)
                    signal_child(WalWriterPID, SIGTERM);
                if (BgWriterPID != 0)
                    signal_child(BgWriterPID, SIGTERM);
+               if (FatalError && CheckpointerPID != 0)
+                   signal_child(CheckpointerPID, SIGUSR2);
 
                /*
                 * If we're in recovery, we can't kill the startup process
@@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS)
                signal_child(WalReceiverPID, SIGTERM);
            if (BgWriterPID != 0)
                signal_child(BgWriterPID, SIGTERM);
+           if (FatalError && CheckpointerPID != 0)
+               signal_child(CheckpointerPID, SIGUSR2);
            if (pmState == PM_RECOVERY)
            {
                /* only checkpointer is active in this state */


Note that since checkpointer can only be running after we enter
FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used,
this bug doesn't seem to affect normal usage.  (I'm not sure SIGUSR2 is
the most appropriate signal to send at this time -- since we're in
FatalError, probably SIGQUIT is better suited.)

One good thing is that when I patched postmaster in a different way
(which I later realized to be bogus), I caused it to die with an
assertion while checkpointer was still running; the debug output let me
know that checkpointer went away immediately.

-- 
Álvaro Herrera <alvhe...@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to