On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote: > On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jami...@fujitsu.com wrote: > >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote: > >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra > >> <tomas.von...@2ndquadrant.com> > >> wrote: > >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but > >> > you've only done tests / benchmark for the case this optimizes. I > >> > think we need to see a benchmark for workload that allocates and > >> > invalidates lot of buffers. A pgbench with a workload that fits into > >> > RAM but not into shared buffers would be interesting. > >> > >> Yeah, it seems pretty hard to believe that this won't be bad for some > workloads. > >> Not only do you have the overhead of the hash table operations, but > >> you also have locking overhead around that. A whole new set of > >> LWLocks where you have to take and release one of them every time you > >> allocate or invalidate a buffer seems likely to cause a pretty substantial > contention problem. > > > >I'm sorry for the late reply. Thank you Tomas and Robert for checking this > patch. > >Attached is the v3 of the patch. > >- I moved the unnecessary items from buf_internals.h to cached_buf.c > >since most of > > of those items are only used in that file. > >- Fixed the bug of v2. Seems to pass both RT and TAP test now > > > >Thanks for the advice on benchmark test. Please refer below for test and > results. > > > >[Machine spec] > >CPU: 16, Number of cores per socket: 8 > >RHEL6.5, Memory: 240GB > > > >scale: 3125 (about 46GB DB size) > >shared_buffers = 8GB > > > >[workload that fits into RAM but not into shared buffers] pgbench -i -s > >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest > > > >[Patched] > >scaling factor: 3125 > >query mode: simple > >number of clients: 16 > >number of threads: 8 > >duration: 600 s > >number of transactions actually processed: 8815123 latency average = > >1.089 ms tps = 14691.436343 (including connections establishing) tps = > >14691.482714 (excluding connections establishing) > > > >[Master/Unpatched] > >... > >number of transactions actually processed: 8852327 latency average = > >1.084 ms tps = 14753.814648 (including connections establishing) tps = > >14753.861589 (excluding connections establishing) > > > > > >My patch caused a little overhead of about 0.42-0.46%, which I think is > >small. > >Kindly let me know your opinions/comments about the patch or tests, etc. > > > > Now try measuring that with a read-only workload, with prepared statements. > I've tried that on a machine with 16 cores, doing > > # 16 clients > pgbench -n -S -j 16 -c 16 -M prepared -T 60 test > > # 1 client > pgbench -n -S -c 1 -M prepared -T 60 test > > and average from 30 runs of each looks like this: > > # clients master patched % > --------------------------------------------------------- > 1 29690 27833 93.7% > 16 300935 283383 94.1% > > That's quite significant regression, considering it's optimizing an > operation that is expected to be pretty rare (people are generally not > dropping dropping objects as often as they query them).
I updated the patch and reduced the lock contention of new LWLock, with tunable definitions in the code and instead of using rnode as the hash key, I also added the modulo of block number. #define NUM_MAP_PARTITIONS_FOR_REL 128 /* relation-level */ #define NUM_MAP_PARTITIONS_IN_REL 4 /* block-level */ #define NUM_MAP_PARTITIONS \ (NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL) I executed again a benchmark for read-only workload, but regression currently sits at 3.10% (reduced from v3's 6%). Average of 10 runs, 16 clients read-only, prepared query mode [Master] num of txn processed: 11,950,983.67 latency average = 0.080 ms tps = 199,182.24 tps = 199,189.54 [V4 Patch] num of txn processed: 11,580,256.36 latency average = 0.083 ms tps = 193,003.52 tps = 193,010.76 I checked the wait event statistics (non-impactful events omitted) and got the following below. I reset the stats before running the pgbench script, Then showed the stats right after the run. [Master] wait_event_type | wait_event | calls | microsec -----------------+-----------------------+----------+---------- Client | ClientRead | 25116 | 49552452 IO | DataFileRead | 14467109 | 92113056 LWLock | buffer_mapping | 204618 | 1364779 [Patch V4] wait_event_type | wait_event | calls | microsec -----------------+-----------------------+----------+---------- Client | ClientRead | 111393 | 68773946 IO | DataFileRead | 14186773 | 90399833 LWLock | buffer_mapping | 463844 | 4025198 LWLock | cached_buf_tranche_id | 83390 | 336080 It seems the buffer_mapping LWLock wait is 4x slower. However, I'd like to continue working on this patch to next commitfest, and further reduce its impact to read-only workloads. Regards, Kirk Jamison
v4-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description: v4-Optimize-dropping-of-relation-buffers-using-dlist.patch