As mentioned previously about my interest in improving shared buffer eviction especially by reducing contention around BufFreelistLock, I would like to share my progress about the same.
The test used for this work is mainly the case when all the data doesn't fit in shared buffers, but does fit in memory. It is mainly based on previous comparison done by Robert for similar workload: http://rhaas.blogspot.in/2012/03/performance-and-scalability-on-ibm.html To start with, I have taken LWLOCK_STATS report to confirm the contention around BufFreelistLock and the data for HEAD is as follows: M/c details IBM POWER-7 16 cores, 64 hardware threads RAM - 64GB Test scale factor = 3000 shared_buffers = 8GB number_of_threads = 64 duration = 5mins ./pgbench -c 64 -j 64 -T 300 -S postgres LWLOCK_STATS data for BufFreeListLock PID 11762 lwlock main 0: shacq 0 exacq 253988 blk 29023 Here the high *blk* count for scale factor 3000 clearly shows that to find a usable buffer when data doesn't fit in shared buffers it has to wait. To solve this issue, I have implemented a patch which makes sure that there are always enough buffers on freelist such that the need for backend to run clock-sweep is minimal, the implementation idea is more or less same as discussed previously in below thread, so I will explain it at end of mail. http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kap...@huawei.com LWLOCK_STATS data after Patch (test used is same as used for HEAD): BufFreeListLock PID 7257 lwlock main 0: shacq 0 exacq 165 blk 18 spindelay 0 Here the low *exacq* and *blk* count shows that the need to run clock sweep for backend has reduced significantly. Performance Data ------------------------------- shared_buffers= 8GB number of threads - 64 sc - scale factor sc tps Head 3000 45569 Patch 3000 46457 Head 1000 93037 Patch 1000 92711 Above data shows that there is no significant change in performance or scalability even after the contention is reduced significantly around BufFreelistLock. I have analyzed the patch both with perf record and LWLOCK_STATS, both indicates that there is a high contention around BufMappingLocks. Data With perf record -a -g ----------------------------------------- + 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep + 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock + 6.88% postgres [kernel.kallsyms] [k] .function_trace_call + 4.15% pgbench [kernel.kallsyms] [k] .try_to_wake_up + 3.20% swapper [kernel.kallsyms] [k] .function_trace_call + 2.99% pgbench [kernel.kallsyms] [k] .function_trace_call + 2.41% postgres postgres [.] AllocSetAlloc + 2.38% postgres [kernel.kallsyms] [k] .try_to_wake_up + 2.27% pgbench [kernel.kallsyms] [k] ._raw_spin_lock + 1.49% postgres [kernel.kallsyms] [k] ._raw_spin_lock_irq + 1.36% postgres postgres [.] AllocSetFreeIndex + 1.09% swapper [kernel.kallsyms] [k] ._raw_spin_lock + 0.91% postgres postgres [.] GetSnapshotData + 0.90% postgres postgres [.] MemoryContextAllocZeroAligned Expanded graph ------------------------------ - 10.14% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep - .pseries_dedicated_idle_sleep - 10.13% .pseries_dedicated_idle_sleep - 10.13% .cpu_idle - 10.00% .start_secondary .start_secondary_prolog - 7.77% postgres [kernel.kallsyms] [k] ._raw_spin_lock - ._raw_spin_lock - 6.63% ._raw_spin_lock - 5.95% .double_rq_lock - .load_balance - 5.95% .__schedule - .schedule - 3.27% .SyS_semtimedop .SyS_ipc syscall_exit semop PGSemaphoreLock LWLockAcquireCommon - LWLockAcquire - 3.27% BufferAlloc ReadBuffer_common - ReadBufferExtended - 3.27% ReadBuffer - 2.73% ReleaseAndReadBuffer - 1.70% _bt_relandgetbuf _bt_search _bt_first btgettuple It shows BufferAlloc->LWLOCK as top contributor and we use BufMappingLocks in BufferAlloc, I have checked other expanded calls as well, StrategyGetBuffer is not present in top contributors. Data with LWLOCK_STATS ---------------------------------------------- BufMappingLocks PID 7245 lwlock main 38: shacq 41117 exacq 34561 blk 36274 spindelay 101 PID 7310 lwlock main 39: shacq 40257 exacq 34219 blk 25886 spindelay 72 PID 7308 lwlock main 40: shacq 41024 exacq 34794 blk 20780 spindelay 54 PID 7314 lwlock main 40: shacq 41195 exacq 34848 blk 20638 spindelay 60 PID 7288 lwlock main 41: shacq 84398 exacq 34750 blk 29591 spindelay 128 PID 7208 lwlock main 42: shacq 63107 exacq 34737 blk 20133 spindelay 81 PID 7245 lwlock main 43: shacq 278001 exacq 34601 blk 53473 spindelay 503 PID 7307 lwlock main 44: shacq 85155 exacq 34440 blk 19062 spindelay 71 PID 7301 lwlock main 45: shacq 61999 exacq 34757 blk 13184 spindelay 46 PID 7235 lwlock main 46: shacq 41199 exacq 34622 blk 9031 spindelay 30 PID 7324 lwlock main 46: shacq 40906 exacq 34692 blk 8799 spindelay 14 PID 7292 lwlock main 47: shacq 41180 exacq 34604 blk 8241 spindelay 25 PID 7303 lwlock main 48: shacq 40727 exacq 34651 blk 7567 spindelay 30 PID 7230 lwlock main 49: shacq 60416 exacq 34544 blk 9007 spindelay 28 PID 7300 lwlock main 50: shacq 44591 exacq 34763 blk 6687 spindelay 25 PID 7317 lwlock main 50: shacq 44349 exacq 34583 blk 6861 spindelay 22 PID 7305 lwlock main 51: shacq 62626 exacq 34671 blk 7864 spindelay 29 PID 7301 lwlock main 52: shacq 60646 exacq 34512 blk 7093 spindelay 36 PID 7324 lwlock main 53: shacq 39756 exacq 34359 blk 5138 spindelay 22 This data shows that after patch, there is no contention for BufFreeListLock, rather there is a huge contention around BufMappingLocks. I have checked that HEAD also has contention around BufMappingLocks. As per my analysis till now, I think reducing contention around BufFreelistLock is not sufficient to improve scalability, we need to work on reducing contention around BufMappingLocks as well. Details of patch ------------------------ 1. Changed bgwriter to move buffers (having usage_count as zero) on free list based on threshold (high_watermark) and decrement the usage count if usage_count is greater than zero. 2. StrategyGetBuffer() will wakeup bgwriter when the number of buffers in freelist drop under low_watermark. Currently I am using hard-coded values, we can choose to make them as configurable later on if required. 3. Work done to get a buffer from freelist is done under spin lock and clock sweep still runs under BufFreelistLock. This is still a WIP patch and some of the changes are just kind of prototype to check the idea, like I have hacked bgwriter code such that it continuously fills the freelist till it is able to put enough buffers on freelist such that it reaches high_watermark and commented some part of previous code. Thoughts? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
scalable_buffer_eviction_v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers