Re: Possible performance regression in version 10.1 with pgbench read-write tests.

Thomas Munro Sun, 22 Jul 2018 20:42:01 -0700

On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Andres Freund <and...@anarazel.de> writes:
>> On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
>>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
>>> only got 8 cores) but other parameters matching Mithun's example,
>>> I just got
>
>> It's *really* common to have more actual clients than cpus for oltp
>> workloads, so I don't think it's insane to test with more clients.
>
> I finished a set of runs using similar parameters to Mithun's test except
> for using 8 clients, and another set using 72 clients (but, being
> impatient, 5-minute runtime) just to verify that the results wouldn't
> be markedly different.  I got TPS numbers like this:
>
>                                 8 clients       72 clients
>
> unmodified HEAD                 16112           16284
> with padding patch              16096           16283
> with SysV semas                 15926           16064
> with padding+SysV               15949           16085
>
> This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual
> 4-core Intel E5-2609 (Sandy Bridge era).  This hardware does show NUMA
> effects, although no doubt less strongly than Mithun's machine.
>
> I would like to see some other results with a newer kernel.  I tried to
> repeat this test on a laptop running Fedora 28, but soon concluded that
> anything beyond very short runs was mainly going to tell me about thermal
> throttling :-(.  I could possibly get repeatable numbers from, say,
> 1-minute SELECT-only runs, but that would be a different test scenario,
> likely one with a lot less lock contention.


I did some testing on 2-node, 4-node and 8-node systems running Linux
3.10.something (slightly newer but still ancient).  Only the 8-node
box (= same one Mithun used) shows the large effect (the 2-node box
may be a tiny bit faster patched but I'm calling that noise for now...
it's not slower, anyway).

On the problematic box, I also tried some different strides (char
padding[N - sizeof(sem_t)]) and was surprised by the result:

Unpatched = ~35k TPS
64 byte stride = ~35k TPS
128 byte stride = ~42k TPS
4096 byte stride = ~47k TPS

Huh.  PG_CACHE_LINE_SIZE is 128, but the true cache line size on this
system is 64 bytes.  That exaggeration turned out to do something
useful, though I can't explain it.

While looking for discussion of 128 byte cache effects I came across
the Intel "L2 adjacent cache line prefetcher"[1].  Maybe this, or some
of the other prefetchers (enabled in the BIOS) or related stuff could
be at work here.  It could be microarchitecture-dependent (this is an
old Westmere box), though I found a fairly recent discussion about a
similar effect[2] that mentions more recent hardware.  The spatial
prefetcher reference can be found in the Optimization Manual[3].

[1] 
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
[2] 
https://groups.google.com/forum/#!msg/mechanical-sympathy/i3-M2uCYTJE/P7vyoOTIAgAJ
[3] 
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Possible performance regression in version 10.1 with pgbench read-write tests.

Reply via email to