On Mon, Feb 08, 2016 at 10:55:24PM -0500, Tom Lane wrote: > Noah Misch <n...@leadboat.com> writes: > > On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: > >> We've seen variants > >> on this theme on half a dozen machines just in the past week --- and it > >> seems to mostly happen in 9.5 and HEAD, which is fishy. > > > It has been affecting only the four AIX animals, which do share hardware. > > (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) > > Certainly your AIX critters have shown this a bunch, but here's another > current example: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
Oops; I did not consider Monday's results before asserting that. > > That's reasonable. If you would like higher-fidelity data, I can run loops > > of > > "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run > > that for HEAD and 9.2 simultaneously. A day of logs from that should show > > clearly if HEAD is systematically worse than 9.2. > > That sounds like a fine plan, please do it. Log files: HEAD: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k 92: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k While I didn't study those logs in detail, a few things jumped out. Since 9.2, we've raised the default shared_buffers from 32MB to 128MB, and we've replaced checkpoint_segments=3 with max_wal_size=1GB. Both changes encourage bulkier checkpoints. The 9.2 test runs get one xlog-driven checkpoint before the shutdown checkpoint, while HEAD gets one time-driven checkpoint. Also, the HEAD suite just tests more things. Here's pg_stat_bgwriter afterward: HEAD: checkpoints_timed | 156 checkpoints_req | 799 checkpoint_write_time | 16035847 checkpoint_sync_time | 6555396 buffers_checkpoint | 744131 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 3023444 buffers_backend_fsync | 0 buffers_alloc | 1777010 stats_reset | 2016-02-08 21:04:24.499607-08 9.2: checkpoints_timed | 39 checkpoints_req | 1369 checkpoint_write_time | 14875776 checkpoint_sync_time | 8397536 buffers_checkpoint | 396272 buffers_clean | 466392 maxwritten_clean | 1336 buffers_backend | 1961531 buffers_backend_fsync | 0 buffers_alloc | 1681324 stats_reset | 2016-02-08 21:09:21.925487-08 Most notable there is the lack of bgwriter help in HEAD. The clusters had initdb-default configuration apart from these additions: listen_addresses='' log_line_prefix = '%p %m ' logging_collector = on log_autovacuum_min_duration = 0 log_checkpoints = on log_lock_waits = on log_temp_files = 128kB I should have added fsync=off, too. Notice how the AIX animals failed left and right today, likely thanks to contention from these runs. On Tue, Feb 09, 2016 at 02:10:50PM -0500, Tom Lane wrote: > (1) Slow file system, specifically slow unlink, is the core of the > problem. (I wonder if the AIX critters are using an NFS filesystem?) The buildfarm files lie on 600 GiB SAS disks. I suspect metadata operations, like unlink(), bottleneck on the jfs2 journal. On Tue, Feb 09, 2016 at 03:05:01PM -0500, Tom Lane wrote: > I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and > *removing* the two existing hacks in run_build.pl that try to force -t 120. > > The only real argument I can see against that approach is that we'd have > to back-patch the PGCTLTIMEOUT patch to all active branches if we want > to stop the buildfarm failures. We don't usually back-patch feature > additions. On the other hand, this wouldn't be the first time we've > back-patched something on grounds of helping the buildfarm, so I find > that argument pretty weak. If I were a purist against back-patching features, I might name the variable PGINTERNAL_TEST_PGCTLTIMEOUT and not document it. Meh to that. I'll plan to commit the original tomorrow. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers