I wrote: > Noah Misch <n...@leadboat.com> writes: >> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: >>> We've seen variants >>> on this theme on half a dozen machines just in the past week --- and it >>> seems to mostly happen in 9.5 and HEAD, which is fishy.
>> It has been affecting only the four AIX animals, which do share hardware. >> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) > Certainly your AIX critters have shown this a bunch, but here's another > current example: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 While we've not seen an actual test failure since I installed the tracing code, we do have reports from the four AIX critters, and they are pretty instructive. See for example http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2 which is Snapshot: 2016-02-09 16:27:05 waiting for server to shut down at 2016-02-09 18:04:09.159 UTC................................................................................ done server stopped =========== db log file ========== 2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG: received fast shutdown request at 2016-02-09 18:04:09.159 UTC 2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG: aborting any active transactions 2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG: autovacuum launcher shutting down at 2016-02-09 18:04:09.160 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG: shutting down at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG: CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG: CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG: CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG: CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG: CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG: CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG: CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG: CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG: CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG: BufferSync(5) beginning to write 632 buffers at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG: BufferSync(5) done, wrote 632/632 buffers at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG: CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG: CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG: CheckPointGuts done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG: checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC 2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG: pg_control updated at 2016-02-09 18:04:09.967 UTC 2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG: in mdpostckpt, 3072 unlinks remain to do at 2016-02-09 18:04:29.375 UTC 2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG: in mdpostckpt, 2048 unlinks remain to do at 2016-02-09 18:04:48.207 UTC 2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG: in mdpostckpt, 1024 unlinks remain to do at 2016-02-09 18:05:07.381 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG: smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG: RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG: TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG: shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG: ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG: ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG: ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG: database system is shut down at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG: lock files all released at 2016-02-09 18:05:26.165 UTC This says that the bulk of the shutdown time is being spent in mdpostckpt, where we are finally unlinking files belonging to relations that were dropped since the last checkpoint. That took 75 seconds here, while the parts before and after it took less than a second each. There were somewhere between 3000 and 4000 files to be unlinked, so it's not astonishing that it took some time, but our other critters seem to get through this a lot quicker. (All four AIX critters reported pretty similar results, BTW.) So my diagnosis at this point is (1) Slow file system, specifically slow unlink, is the core of the problem. (I wonder if the AIX critters are using an NFS filesystem?) (2) The apparent increase in failure rate in more recent branches is probably real, but the likely cause is simply more test cases = more stuff to be cleaned up at the end = mdpostckpt takes longer. I'm not sure whether there's anything to be gained by leaving the tracing code in there till we see actual buildfarm fails. There might be another slowdown mechanism somewhere, but I rather doubt it. Thoughts? In any case, we should proceed with fixing things so that buildfarm owners can specify a higher shutdown timeout for especially slow critters. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers