Of late, by far the majority of the random-noise failures we see in the buildfarm have come from failure to shut down the postmaster in a reasonable timeframe. An example is this current failure on hornet:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55 waiting for server to shut down........................................................................................................................... failed pg_ctl: server does not shut down =========== db log file ========== 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down The buildfarm script runs pg_ctl stop with -t 120, and counting the dots shows that pg_ctl was honoring that, so apparently the postmaster took more than 2 minutes to shut down. Looking at other recent successful runs, stopdb-C-1 usually takes 30 to 40 or so seconds on this machine. So it's possible that it was just so overloaded that it took three times that much time on this run, but I'm starting to think there may be more to it than that. We've seen variants on this theme on half a dozen machines just in the past week --- and it seems to mostly happen in 9.5 and HEAD, which is fishy. The fact that we got "shutting down" but not "database system is shut down" indicates that the server was in ShutdownXLOG(). Normally the only component of that that takes much time is buffer flushing, but could something else be happening? Or perhaps the checkpoint code has somehow not gotten the word to do its flushing at full speed? What I'd like to do to investigate this is put in a temporary HEAD-only patch that makes ShutdownXLOG() and its subroutines much chattier about how far they've gotten and what time it is, and also makes pg_ctl print out the current time if it gives up waiting. A few failed runs with that in place will at least allow us to confirm or deny whether it's just that the shutdown checkpoint is sometimes really slow, or whether there's a bug lurking. Any objections? Anybody have another idea for data to collect? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers