On Wed, Sep 14, 2016 at 12:06 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I wrote: >> At -j 10 -c 10, all else the same, I get 84928 TPS on HEAD and 90357 >> with the patch, so about 6% better. > > And at -j 1 -c 1, I get 22390 and 24040 TPS, or about 7% better with > the patch. So what I am seeing on OS X isn't contention of any sort, > but just a straight speedup that's independent of the number of clients > (at least up to 10). Probably this represents less setup/teardown cost > for kqueue() waits than poll() waits.
Thanks for running all these tests. I hadn't considered OS X performance. > So you could spin this as "FreeBSD's poll() implementation is better than > OS X's", or as "FreeBSD's kqueue() implementation is worse than OS X's", > but either way I do not think we're seeing the same issue that was > originally reported against Linux, where there was no visible problem at > all till you got to a couple dozen clients, cf > > https://www.postgresql.org/message-id/CAB-SwXbPmfpgL6N4Ro4BbGyqXEqqzx56intHHBCfvpbFUx1DNA%40mail.gmail.com > > I'm inclined to think the kqueue patch is worth applying just on the > grounds that it makes things better on OS X and doesn't seem to hurt > on FreeBSD. Whether anyone would ever get to the point of seeing > intra-kernel contention on these platforms is hard to predict, but > we'd be ahead of the curve if so. I was originally thinking of this as simply the obvious missing implementation of Andres's WaitEventSet API which would surely pay off later as we do more with that API (asynchronous execution with many remote nodes for sharding, built-in connection pooling/admission control for large numbers of sockets?, ...). I wasn't really expecting it to show performance increases in simple one or two pipe/socket cases on small core count machines, and it's interesting that it clearly does on OS X. > It would be good for someone else to reproduce my results though. > For one thing, 5%-ish is not that far above the noise level; maybe > what I'm measuring here is just good luck from relocation of critical > loops into more cache-line-friendly locations. Similar results here on a 4 core 2.2GHz Core i7 MacBook Pro running OS X 10.11.5. With default settings except fsync = off, I ran pgbench -i -s 100, then took the median result of three runs of pgbench -T 60 -j 4 -c 4 -M prepared -S. I used two different compilers in case it helps to see results with different random instruction cache effects, and got the following numbers: Apple clang 703.0.31: 51654 TPS -> 55739 TPS = 7.9% improvement GCC 6.1.0 from MacPorts: 52552 TPS -> 55143 TPS = 4.9% improvement I reran the tests under FreeBSD 10.3 on a 4 core laptop and again saw absolutely no measurable difference at 1, 4 or 24 clients. Maybe a big enough server could be made to contend on the postmaster pipe's selinfo->si_mtx, in selrecord(), in pipe_poll() -- maybe that'd be directly equivalent to what happened on multi-socket Linux with poll(), but I don't know. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers