On Thu, Oct 13, 2016 at 7:53 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On 10/12/2016 08:55 PM, Robert Haas wrote: >> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbal...@gmail.com> wrote: >>> I think at higher client count from client count 96 onwards contention >>> on CLogControlLock is clearly visible and which is completely solved >>> with group lock patch. >>> >>> And at lower client count 32,64 contention on CLogControlLock is not >>> significant hence we can not see any gain with group lock patch. >>> (though we can see some contention on CLogControlLock is reduced at 64 >>> clients.) >> >> I agree with these conclusions. I had a chance to talk with Andres >> this morning at Postgres Vision and based on that conversation I'd >> like to suggest a couple of additional tests: >> >> 1. Repeat this test on x86. In particular, I think you should test on >> the EnterpriseDB server cthulhu, which is an 8-socket x86 server. >> >> 2. Repeat this test with a mixed read-write workload, like -b >> tpcb-like@1 -b select-only@9 >> > > FWIW, I'm already running similar benchmarks on an x86 machine with 72 > cores (144 with HT). It's "just" a 4-socket system, but the results I > got so far seem quite interesting. The tooling and results (pushed > incrementally) are available here: > > https://bitbucket.org/tvondra/hp05-results/overview > > The tooling is completely automated, and it also collects various stats, > like for example the wait event. So perhaps we could simply run it on > ctulhu and get comparable results, and also more thorough data sets than > just snippets posted to the list? > > There's also a bunch of reports for the 5 already completed runs > > - dilip-300-logged-sync > - dilip-300-unlogged-sync > - pgbench-300-logged-sync-skip > - pgbench-300-unlogged-sync-noskip > - pgbench-300-unlogged-sync-skip > > The name identifies the workload type, scale and whether the tables are > wal-logged (for pgbench the "skip" means "-N" while "noskip" does > regular pgbench). > > For example the "reports/wait-events-count-patches.txt" compares the > wait even stats with different patches applied (and master): > > https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default > > and average tps (from 3 runs, 5 minutes each): > > https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default > > There are certainly interesting bits. For example while the "logged" > case is dominated y WALWriteLock for most client counts, for large > client counts that's no longer true. > > Consider for example dilip-300-logged-sync results with 216 clients: > > wait_event | master | gran_lock | no_cont_lock | group_upd > --------------------+---------+-----------+--------------+----------- > CLogControlLock | 624566 | 474261 | 458599 | 225338 > WALWriteLock | 431106 | 623142 | 619596 | 699224 > | 331542 | 358220 | 371393 | 537076 > buffer_content | 261308 | 134764 | 138664 | 102057 > ClientRead | 59826 | 100883 | 103609 | 118379 > transactionid | 26966 | 23155 | 23815 | 31700 > ProcArrayLock | 3967 | 3852 | 4070 | 4576 > wal_insert | 3948 | 10430 | 9513 | 12079 > clog | 1710 | 4006 | 2443 | 925 > XidGenLock | 1689 | 3785 | 4229 | 3539 > tuple | 965 | 617 | 655 | 840 > lock_manager | 300 | 571 | 619 | 802 > WALBufMappingLock | 168 | 140 | 158 | 147 > SubtransControlLock | 60 | 115 | 124 | 105 > > Clearly, CLOG is an issue here, and it's (slightly) improved by all the > patches (group_update performing the best). And with 288 clients (which > is 2x the number of virtual cores in the machine, so not entirely crazy) > you get this: > > wait_event | master | gran_lock | no_cont_lock | group_upd > --------------------+---------+-----------+--------------+----------- > CLogControlLock | 901670 | 736822 | 728823 | 398111 > buffer_content | 492637 | 318129 | 319251 | 270416 > WALWriteLock | 414371 | 593804 | 589809 | 656613 > | 380344 | 452936 | 470178 | 745790 > ClientRead | 60261 | 111367 | 111391 | 126151 > transactionid | 43627 | 34585 | 35464 | 48679 > wal_insert | 5423 | 29323 | 25898 | 30191 > ProcArrayLock | 4379 | 3918 | 4006 | 4582 > clog | 2952 | 9135 | 5304 | 2514 > XidGenLock | 2182 | 9488 | 8894 | 8595 > tuple | 2176 | 1288 | 1409 | 1821 > lock_manager | 323 | 797 | 827 | 1006 > WALBufMappingLock | 124 | 124 | 146 | 206 > SubtransControlLock | 85 | 146 | 170 | 120 > > So even buffer_content gets ahead of the WALWriteLock. I wonder whether > this might be because of only having 128 buffers for clog pages, causing > contention on this system (surely, systems with 144 cores were not that > common when the 128 limit was introduced). >
Not sure, but I have checked if we increase clog buffers greater than 128, then it causes dip in performance on read-write workload in some cases. Apart from that from above results, it is quite clear that patches help in significantly reducing the CLOGControlLock contention with group-update patch consistently better, probably because with this workload is more contended on writing the transaction status. > So the patch has positive impact even with WAL, as illustrated by tps > improvements (for large client counts): > > clients | master | gran_locking | no_content_lock | group_update > ---------+--------+--------------+-----------------+-------------- > 36 | 39725 | 39627 | 41203 | 39763 > 72 | 70533 | 65795 | 65602 | 66195 > 108 | 81664 | 87415 | 86896 | 87199 > 144 | 68950 | 98054 | 98266 | 102834 > 180 | 105741 | 109827 | 109201 | 113911 > 216 | 62789 | 92193 | 90586 | 98995 > 252 | 94243 | 102368 | 100663 | 107515 > 288 | 57895 | 83608 | 82556 | 91738 > > I find the tps fluctuation intriguing, and I'd like to see that fixed > before committing any of the patches. > I have checked the wait event results where there is more fluctuation: test | clients | wait_event_type | wait_event | master | granular_locking | no_content_lock | group_update ----------------------------------+---------+-----------------+---------------------+---------+------------------+-----------------+-------------- dilip-300-unlogged-sync | 108 | LWLockNamed | CLogControlLock | 343526 | 502127 | 479937 | 301381 dilip-300-unlogged-sync | 180 | LWLockNamed | CLogControlLock | 557639 | 835567 | 795403 | 512707 So, if I read above results correctly, then it shows that group-update has helped slightly to reduce the contention and one probable reason could be that we need to update clog status on different clog pages more frequently on such a workload and may be need to perform disk page reads for clog pages as well, so the benefit of grouping will certainly be less. This is because page read requests will get serialized and only leader backend needs to perform all such requests. Robert has pointed about somewhat similar case upthread [1] and I have modified the patch as well to use multiple slots (groups) for group transaction status update [2], but we didn't pursued, because on pgbench workload, it didn't showed any benefit. However, may be here it can show some benefit, if we could make above results reproducible and you guys think that above theory sounds reasonable, then I can again modify the patch based on that idea. Now, the story with granular_locking and no_content_lock patches seems to be worse, because they seem to be increasing the contention on CLOGControlLock rather than reducing it. I think one of the probable reasons that could happen for both the approaches is that it frequently needs to release the CLogControlLock acquired in Shared mode and reacquire it in Exclusive mode as the clog page to modify is not in buffer (different clog page update then the currently in buffer) and then once again it needs to release the CLogControlLock lock to read the clog page from disk and acquire it again in Exclusive mode. This frequent release-acquire of CLOGControlLock in different modes could lead to significant increase in contention. It is slightly more for granular_locking patch as it needs one additional lock (buffer_content_lock) in Exclusive mode after acquiring CLogControlLock. Offhand, I could not see a way to reduce the contention with granular_locking and no_content_lock patches. So, the crux is that we are seeing more variability in some of the results because of frequent different clog page accesses which is not so easy to predict, but I think it is possible with ~100,000 tps. > > There's certainly much more interesting stuff in the results, but I > don't have time for more thorough analysis now - I only intended to do > some "quick benchmarking" on the patch, and I've already spent days on > this, and I have other things to do. > Thanks a ton for doing such a detailed testing. [1] - https://www.postgresql.org/message-id/CA%2BTgmoahCx6XgprR%3Dp5%3D%3DcF0g9uhSHsJxVdWdUEHN9H2Mv0gkw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1%2BSoW3FBrdZV%2B3m34uCByK3DMPy_9QQs34yvN8spByzyA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers