On Fri, Oct 1, 2021 at 3:26 PM Yura Sokolov <y.soko...@postgrespro.ru> wrote:
> Good day. > > I found some opportunity in Buffer Manager code in BufferAlloc > function: > - When valid buffer is evicted, BufferAlloc acquires two partition > lwlocks: for partition for evicted block is in and partition for new > block placement. > > It doesn't matter if there is small number of concurrent replacements. > But if there are a lot of concurrent backends replacing buffers, > complex dependency net quickly arose. > > It could be easily seen with select-only pgbench with scale 100 and > shared buffers 128MB: scale 100 produces 1.5GB tables, and it certainly > doesn't fit shared buffers. This way performance starts to degrade at > ~100 connections. Even with shared buffers 1GB it slowly degrades after > 150 connections. > > But strictly speaking, there is no need to hold both lock > simultaneously. Buffer is pinned so other processes could not select it > for eviction. If tag is cleared and buffer removed from old partition > then other processes will not find it. Therefore it is safe to release > old partition lock before acquiring new partition lock. > > If other process concurrently inserts same new buffer, then old buffer > is placed to bufmanager's freelist. > > Additional optimisation: in case of old buffer is reused, there is no > need to put its BufferLookupEnt into dynahash's freelist. That reduces > lock contention a bit more. To acomplish this FreeListData.nentries is > changed to pg_atomic_u32/pg_atomic_u64 and atomic increment/decrement > is used. > > Remark: there were bug in the `hash_update_hash_key`: nentries were not > kept in sync if freelist partitions differ. This bug were never > triggered because single use of `hash_update_hash_key` doesn't move > entry between partitions. > > There is some tests results. > > - pgbench with scale 100 were tested with --select-only (since we want > to test buffer manager alone). It produces 1.5GB table. > - two shared_buffers values were tested: 128MB and 1GB. > - second best result were taken among five runs > > Test were made in three system configurations: > - notebook with i7-1165G7 (limited to 2.8GHz to not overheat) > - Xeon X5675 6 core 2 socket NUMA system (12 cores/24 threads). > - same Xeon X5675 but restricted to single socket > (with numactl -m 0 -N 0) > > Results for i7-1165G7: > > conns | master | patched | master 1G | patched 1G > --------+------------+------------+------------+------------ > 1 | 29667 | 29079 | 29425 | 29411 > 2 | 55577 | 55553 | 57974 | 57223 > 3 | 87393 | 87924 | 87246 | 89210 > 5 | 136222 | 136879 | 133775 | 133949 > 7 | 179865 | 176734 | 178297 | 175559 > 17 | 215953 | 214708 | 222908 | 223651 > 27 | 211162 | 213014 | 220506 | 219752 > 53 | 211620 | 218702 | 220906 | 225218 > 83 | 213488 | 221799 | 219075 | 228096 > 107 | 212018 | 222110 | 222502 | 227825 > 139 | 207068 | 220812 | 218191 | 226712 > 163 | 203716 | 220793 | 213498 | 226493 > 191 | 199248 | 217486 | 210994 | 221026 > 211 | 195887 | 217356 | 209601 | 219397 > 239 | 193133 | 215695 | 209023 | 218773 > 271 | 190686 | 213668 | 207181 | 219137 > 307 | 188066 | 214120 | 205392 | 218782 > 353 | 185449 | 213570 | 202120 | 217786 > 397 | 182173 | 212168 | 201285 | 216489 > > Results for 1 socket X5675 > > conns | master | patched | master 1G | patched 1G > --------+------------+------------+------------+------------ > 1 | 16864 | 16584 | 17419 | 17630 > 2 | 32764 | 32735 | 34593 | 34000 > 3 | 47258 | 46022 | 49570 | 47432 > 5 | 64487 | 64929 | 68369 | 68885 > 7 | 81932 | 82034 | 87543 | 87538 > 17 | 114502 | 114218 | 127347 | 127448 > 27 | 116030 | 115758 | 130003 | 128890 > 53 | 116814 | 117197 | 131142 | 131080 > 83 | 114438 | 116704 | 130198 | 130985 > 107 | 113255 | 116910 | 129932 | 131468 > 139 | 111577 | 116929 | 129012 | 131782 > 163 | 110477 | 116818 | 128628 | 131697 > 191 | 109237 | 116672 | 127833 | 131586 > 211 | 108248 | 116396 | 127474 | 131650 > 239 | 107443 | 116237 | 126731 | 131760 > 271 | 106434 | 115813 | 126009 | 131526 > 307 | 105077 | 115542 | 125279 | 131421 > 353 | 104516 | 115277 | 124491 | 131276 > 397 | 103016 | 114842 | 123624 | 131019 > > Results for 2 socket x5675 > > conns | master | patched | master 1G | patched 1G > --------+------------+------------+------------+------------ > 1 | 16323 | 16280 | 16959 | 17598 > 2 | 30510 | 31431 | 33763 | 31690 > 3 | 45051 | 45834 | 48896 | 47991 > 5 | 71800 | 73208 | 78077 | 74714 > 7 | 89792 | 89980 | 95986 | 96662 > 17 | 178319 | 177979 | 195566 | 196143 > 27 | 210475 | 205209 | 226966 | 235249 > 53 | 222857 | 220256 | 252673 | 251041 > 83 | 219652 | 219938 | 250309 | 250464 > 107 | 218468 | 219849 | 251312 | 251425 > 139 | 210486 | 217003 | 250029 | 250695 > 163 | 204068 | 218424 | 248234 | 252940 > 191 | 200014 | 218224 | 246622 | 253331 > 211 | 197608 | 218033 | 245331 | 253055 > 239 | 195036 | 218398 | 243306 | 253394 > 271 | 192780 | 217747 | 241406 | 253148 > 307 | 189490 | 217607 | 239246 | 253373 > 353 | 186104 | 216697 | 236952 | 253034 > 397 | 183507 | 216324 | 234764 | 252872 > > As can be seen, patched version degrades much slower than master. > (Or even doesn't degrade with 1G shared buffer on older processor). > > PS. > > There is a room for further improvements: > - buffer manager's freelist could be partitioned > - dynahash's freelist could be sized/aligned to CPU cache line > - in fact, there is no need in dynahash at all. It is better to make > custom hash-table using BufferDesc as entries. BufferDesc has spare > space for link to next and hashvalue. > > regards, > Yura Sokolov > y.soko...@postgrespro.ru > funny.fal...@gmail.com Hi, Improvement is impressive. For BufTableFreeDeleted(), since it only has one call, maybe its caller can invoke hash_return_to_freelist() directly. For free_list_decrement_nentries(): + Assert(hctl->freeList[freelist_idx].nentries.value < MAX_NENTRIES); Is the assertion necessary ? There is similar assertion in free_list_increment_nentries() which would maintain hctl->freeList[freelist_idx].nentries.value <= MAX_NENTRIES. Cheers