[HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Tom Lane Mon, 08 Feb 2016 11:16:30 -0800

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe.  An example is this current failure on hornet:


http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut 
down...........................................................................................................................
 failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG:  received fast shutdown 
request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG:  aborting any active 
transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG:  autovacuum launcher 
shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG:  shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine.  So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that.  We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG().  Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening?  Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting.  A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections?  Anybody have another idea for data to collect?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Reply via email to