RE: [Patch] Optimize dropping of relation buffers using dlist

k.jami...@fujitsu.com Wed, 27 Nov 2019 19:19:49 -0800

On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jami...@fujitsu.com wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> <tomas.von...@2ndquadrant.com>
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> >     you've only done tests / benchmark for the case this optimizes. I
> >> >     think we need to see a benchmark for workload that allocates and
> >> >     invalidates lot of buffers. A pgbench with a workload that fits into
> >> >     RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> >  of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is 
> >small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
> 
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
> 
>    # 16 clients
>    pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
> 
>    # 1 client
>    pgbench -n -S -c 1 -M prepared -T 60 test
> 
> and average from 30 runs of each looks like this:
> 
>     # clients      master         patched         %
>    ---------------------------------------------------------
>     1              29690          27833           93.7%
>     16            300935         283383           94.1%
> 
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).


I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL      128     /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL       4       /* block-level */
#define NUM_MAP_PARTITIONS \
        (NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL) 

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36 
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76


I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

[Master]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |   25116  | 49552452
 IO              | DataFileRead          | 14467109 | 92113056
 LWLock          | buffer_mapping        |   204618 |  1364779

[Patch V4]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |  111393  | 68773946
 IO              | DataFileRead          | 14186773 | 90399833
 LWLock          | buffer_mapping        |   463844 |  4025198
 LWLock          | cached_buf_tranche_id |    83390 |   336080

It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.


Regards,
Kirk Jamison

v4-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description: v4-Optimize-dropping-of-relation-buffers-using-dlist.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

Reply via email to