Hi, On 09/19/2016 09:10 PM, Robert Haas wrote: >
It's possible that the effect of this patch depends on the number of sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4 sockets. I assume Dilip's tests were run on one of those two, although he doesn't seem to have mentioned which one. Your system is probably 2 or 4 sockets, which might make a difference. Results might also depend on CPU architecture; power2 is, unsurprisingly, a POWER system, whereas I assume you are testing x86. Maybe somebody who has access should test on hydra.pg.osuosl.org, which is a community POWER resource. (Send me a private email if you are a known community member who wants access for benchmarking purposes.)
Yes, I'm using x86 machines: 1) large but slightly old - 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total) - kernel 3.2.80 2) smaller but fresh - 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total) - kernel 4.8.0
Personally, I find the results so far posted on this thread thoroughly unimpressive. I acknowledge that Dilip's results appear to show that in a best-case scenario these patches produce a rather large gain. However, that gain seems to happen in a completely contrived scenario: astronomical client counts, unlogged tables, and a test script that maximizes pressure on CLogControlLock. If you have to work that hard to find a big win, and tests under more reasonable conditions show no benefit, it's not clear to me that it's really worth the time we're all spending benchmarking and reviewing this, or the risk of bugs, or the damage to the SLRU abstraction layer. I think there's a very good chance that we're better off moving on to projects that have a better chance of helping in the real world.
I'm posting results from two types of workloads - traditional r/w pgbench and Dilip's transaction. With synchronous_commit on/off.
Full results (including script driving the benchmark) are available here, if needed:
https://bitbucket.org/tvondra/group-clog-benchmark/srcIt'd be good if someone could try reproduce this on a comparable machine, to rule out my stupidity.
2 x e5-2620 v4 (16 cores, 32 with HT) =====================================On the "smaller" machine the results look like this - I have only tested up to 64 clients, as higher values seem rather uninteresting on a machine with only 16 physical cores.
These are averages of 5 runs, where the min/max for each group are within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet also shows the numbers as % compared to master.
dilip / sync=off 1 4 8 16 32 64 ---------------------------------------------------------------------- master 4756 17672 35542 57303 74596 82138 granular-locking 4745 17728 35078 56105 72983 77858 no-content-lock 4646 17650 34887 55794 73273 79000 group-update 4582 17757 35383 56974 74387 81794 dilip / sync=on 1 4 8 16 32 64 ---------------------------------------------------------------------- master 4819 17583 35636 57437 74620 82036 granular-locking 4568 17816 35122 56168 73192 78462 no-content-lock 4540 17662 34747 55560 73508 79320 group-update 4495 17612 35474 57095 74409 81874 pgbench / sync=off 1 4 8 16 32 64 ---------------------------------------------------------------------- master 3791 14368 27806 43369 54472 62956 granular-locking 3822 14462 27597 43173 56391 64669 no-content-lock 3725 14212 27471 43041 55431 63589 group-update 3895 14453 27574 43405 56783 62406 pgbench / sync=on 1 4 8 16 32 64 ---------------------------------------------------------------------- master 3907 14289 27802 43717 56902 62916 granular-locking 3770 14503 27636 44107 55205 63903 no-content-lock 3772 14111 27388 43054 56424 64386 group-update 3844 14334 27452 43621 55896 62498There's pretty much no improvement at all - most of the results are within 1-2% of master, in both directions. Hardly a win.
Actually, with 1 client there seems to be ~5% regression, but it might also be noise and verifying it would require further testing.
4 x e5-4620 v1 (32 cores, 64 with HT) ===================================== These are averages of 10 runs, and there are a few strange things here.Firstly, for Dilip's workload the results get much (much) worse between 64 and 128 clients, for some reason. I suspect this might be due to fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel and try again.
Secondly, the min/max differences get much larger than the ~5% on the smaller machine - with 128 clients, the (max-min)/average is often >100%. See the "spread" or "spread2" sheets in the attached file.
But for some reason this only affects Dilip's workload, and apparently the patches make it measurably worse (master is ~75%, patches ~120%). If you look at tps for individual runs, there's usually 9 runs with almost the same performance, and then one or two much faster runs. Again, the pgbench seems not to have this issue.
I have no idea what's causing this - it might be related to the kernel, but I'm not sure why it should affect the patches differently. Let's see how the new kernel affects this.
dilip / sync=off 16 32 64 128 192 -------------------------------------------------------------- master 26198 37901 37211 14441 8315 granular-locking 25829 38395 40626 14299 8160 no-content-lock 25872 38994 41053 14058 8169 group-update 26503 38911 42993 19474 8325 dilip / sync=on 16 32 64 128 192 -------------------------------------------------------------- master 26138 37790 38492 13653 8337 granular-locking 25661 38586 40692 14535 8311 no-content-lock 25653 39059 41169 14370 8373 group-update 26472 39170 42126 18923 8366 pgbench / sync=off 16 32 64 128 192 -------------------------------------------------------------- master 23001 35762 41202 31789 8005 granular-locking 23218 36130 42535 45850 8701 no-content-lock 23322 36553 42772 47394 8204 group-update 23129 36177 41788 46419 8163 pgbench / sync=on 16 32 64 128 192 -------------------------------------------------------------- master 22904 36077 41295 35574 8297 granular-locking 23323 36254 42446 43909 8959 no-content-lock 23304 36670 42606 48440 8813 group-update 23127 36696 41859 46693 8345So there is some improvement due to the patches for 128 clients (+30% in some cases), but it's rather useless as 64 clients either give you comparable performance (pgbench workload) or way better one (Dilip's workload).
Also, pretty much no difference between synchronous_commit on/off, probably thanks to running on unlogged tables.
I'll repeat the test on the 4-socket machine with a newer kernel, but that's probably the last benchmark I'll do for this patch for now. I agree with Robert that the cases the patch is supposed to improve are a bit contrived because of the very high client counts.
IMHO to continue with the patch (or even with testing it), we really need a credible / practical example of a real-world workload that benefits from the patches. The closest we have to that is Amit's suggestion someone hit the commit lock when running HammerDB, but we have absolutely no idea what parameters they were using, except that they were running with synchronous_commit=off. Pgbench shows no such improvements (at least for me), at least with reasonable parameters.
regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
results.ods
Description: application/vnd.oasis.opendocument.spreadsheet
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers