On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Andres Freund <and...@anarazel.de> writes: >> On 2018-07-20 16:43:33 -0400, Tom Lane wrote: >>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've >>> only got 8 cores) but other parameters matching Mithun's example, >>> I just got > >> It's *really* common to have more actual clients than cpus for oltp >> workloads, so I don't think it's insane to test with more clients. > > I finished a set of runs using similar parameters to Mithun's test except > for using 8 clients, and another set using 72 clients (but, being > impatient, 5-minute runtime) just to verify that the results wouldn't > be markedly different. I got TPS numbers like this: > > 8 clients 72 clients > > unmodified HEAD 16112 16284 > with padding patch 16096 16283 > with SysV semas 15926 16064 > with padding+SysV 15949 16085 > > This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual > 4-core Intel E5-2609 (Sandy Bridge era). This hardware does show NUMA > effects, although no doubt less strongly than Mithun's machine. > > I would like to see some other results with a newer kernel. I tried to > repeat this test on a laptop running Fedora 28, but soon concluded that > anything beyond very short runs was mainly going to tell me about thermal > throttling :-(. I could possibly get repeatable numbers from, say, > 1-minute SELECT-only runs, but that would be a different test scenario, > likely one with a lot less lock contention.
I did some testing on 2-node, 4-node and 8-node systems running Linux 3.10.something (slightly newer but still ancient). Only the 8-node box (= same one Mithun used) shows the large effect (the 2-node box may be a tiny bit faster patched but I'm calling that noise for now... it's not slower, anyway). On the problematic box, I also tried some different strides (char padding[N - sizeof(sem_t)]) and was surprised by the result: Unpatched = ~35k TPS 64 byte stride = ~35k TPS 128 byte stride = ~42k TPS 4096 byte stride = ~47k TPS Huh. PG_CACHE_LINE_SIZE is 128, but the true cache line size on this system is 64 bytes. That exaggeration turned out to do something useful, though I can't explain it. While looking for discussion of 128 byte cache effects I came across the Intel "L2 adjacent cache line prefetcher"[1]. Maybe this, or some of the other prefetchers (enabled in the BIOS) or related stuff could be at work here. It could be microarchitecture-dependent (this is an old Westmere box), though I found a fairly recent discussion about a similar effect[2] that mentions more recent hardware. The spatial prefetcher reference can be found in the Optimization Manual[3]. [1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors [2] https://groups.google.com/forum/#!msg/mechanical-sympathy/i3-M2uCYTJE/P7vyoOTIAgAJ [3] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf -- Thomas Munro http://www.enterprisedb.com