Hi, Yura. At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.soko...@postgrespro.ru> wrot e in > Ok, I got access to stronger server, did the benchmark, found weird > things, and so here is new version :-)
Thanks for the new version and benchmarking. > First I found if table size is strictly limited to NBuffers and FIXED, > then under high concurrency get_hash_entry may not find free entry > despite it must be there. It seems while process scans free lists, other > concurrent processes "moves etry around", ie one concurrent process > fetched it from one free list, other process put new entry in other > freelist, and unfortunate process missed it since it tests freelists > only once. StrategyGetBuffer believes that entries don't move across freelists and it was true before this patch. > Second, I confirm there is problem with freelist spreading. > If I keep entry's freelist_idx, then one freelist is crowded. > If I use new entry's freelist_idx, then one freelist is emptified > constantly. Perhaps it is what I saw before. I'm not sure about the details of how that happens, though. > Third, I found increased concurrency could harm. When popular block > is evicted for some reason, then thundering herd effect occures: > many backends wants to read same block, they evict many other > buffers, but only one is inserted. Other goes to freelist. Evicted > buffers by itself reduce cache hit ratio and provocates more > work. Old version resists this effect by not removing old buffer > before new entry is successfully inserted. Nice finding. > To fix this issues I made following: > > # Concurrency > > First, I limit concurrency by introducing other lwlocks tranche - > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs > 128). > If backend doesn't find buffer in buffer table and wants to introduce > it, it first calls > LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE) > If lock were acquired, then it goes to eviction and replace process. > Otherwise, it waits lock to be released and repeats search. > > This greately improve performance for > 400 clients in pgbench. So the performance difference between the existing code and v11 is the latter has a collision cross section eight times smaller than the former? + * Prevent "thundering herd" problem and limit concurrency. this is something like pressing accelerator and break pedals at the same time. If it improves performance, just increasing the number of buffer partition seems to work? It's also not great that follower backends runs a busy loop on the lock until the top-runner backend inserts the new buffer to the buftable then releases the newParititionLock. > I tried other variant as well: > - first insert entry with dummy buffer index into buffer table. > - if such entry were already here, then wait it to be filled. > - otherwise find victim buffer and replace dummy index with new one. > Wait were done with shared lock on EvictPartitionLock as well. > This variant performed quite same. This one looks better to me. Since a partition can be shared by two or more new-buffers, condition variable seems to work better here... > Logically I like that variant more, but there is one gotcha: > FlushBuffer could fail with elog(ERROR). Therefore then there is > a need to reliable remove entry with dummy index. Perhaps UnlockBuffers can do that. > And after all, I still need to hold EvictPartitionLock to notice > waiters. > I've tried to use ConditionalVariable, but its performance were much > worse. How many CVs did you use? > # Dynahash capacity and freelists. > > I returned back buffer table initialization: > - removed FIXES_SIZE restriction introduced in previous version Mmm. I don't see v10 in this list and v9 doesn't contain FIXES_SIZE.. > - returned `NBuffers + NUM_BUFFER_PARTITIONS`. > I really think, there should be more spare items, since almost always > entry_alloc is called at least once (on 128MB shared_buffers). But > let keep it as is for now. Maybe s/entry_alloc/element_alloc/ ? :p I see it with shared_buffers=128kB (not MB) and pgbench -i on master. The required number of elements are already allocaed to freelists at hash creation. So the reason for the call is imbalanced use among freelists. Even in that case other freelists holds elements. So we don't need to expand the element size. > `get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists > before falling back to `entry_alloc`, and probing is changed from > linear to quadratic. This greately reduces number of calls to > `entry_alloc`, so more shared memory left intact. And I didn't notice > large performance hit from. Probably there is some, but I think it is > adequate trade-off. I don't think that causes significant performance hit, but I don't understand how it improves freelist hit ratio other than by accident. Could you have some reasoning for it? By the way the change of get_hash_entry looks something wrong. If I understand it correctly, it visits num_freelists/4 freelists at once, then tries element_alloc. If element_alloc() fails (that must happen), it only tries freeList[freelist_idx] and gives up, even though there must be an element in other 3/4 freelists. > `free_reused_entry` now returns entry to random position. It flattens > free entry's spread. Although it is not enough without other changes > (thundering herd mitigation and probing more lists in get_hash_entry). If "thudering herd" means "many backends rush trying to read-in the same page at once", isn't it avoided by the change in BufferAlloc? I feel the random returning method might work. I want to get rid of the randomness here but I don't come up with a better way. Anyway the code path is used only by buftable so it doesn't harm generally. > # Benchmarks # Thanks for benchmarking!! > Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz > 18 cores per socket + hyper-threading - upto 72 virtual core total. > turbo-boost disabled > Linux 5.10.103-1 Debian. > > pgbench scale 100 simple_select + simple select with 3 keys (sql file > attached). > > shared buffers 128MB & 1GB > huge_pages=on > > 1 socket > conns | master | patch-v11 | master 1G | patch-v11 1G > --------+------------+------------+------------+------------ > 1 | 27882 | 27738 | 32735 | 32439 > 2 | 54082 | 54336 | 64387 | 63846 > 3 | 80724 | 81079 | 96387 | 94439 > 5 | 134404 | 133429 | 160085 | 157399 > 7 | 185977 | 184502 | 219916 | 217142 v11+128MB degrades above here.. > 17 | 335345 | 338214 | 393112 | 388796 > 27 | 393686 | 394948 | 447945 | 444915 > 53 | 572234 | 577092 | 678884 | 676493 > 83 | 558875 | 561689 | 669212 | 655697 > 107 | 553054 | 551896 | 654550 | 646010 > 139 | 541263 | 538354 | 641937 | 633840 > 163 | 532932 | 531829 | 635127 | 627600 > 191 | 524647 | 524442 | 626228 | 617347 > 211 | 521624 | 522197 | 629740 | 613143 v11+1GB degrades above here.. > 239 | 509448 | 554894 | 652353 | 652972 > 271 | 468190 | 557467 | 647403 | 661348 > 307 | 454139 | 558694 | 642229 | 657649 > 353 | 446853 | 554301 | 635991 | 654571 > 397 | 441909 | 549822 | 625194 | 647973 > > 1 socket 3 keys > > conns | master | patch-v11 | master 1G | patch-v11 1G > --------+------------+------------+------------+------------ > 1 | 16677 | 16477 | 22219 | 22030 > 2 | 32056 | 31874 | 43298 | 43153 > 3 | 48091 | 47766 | 64877 | 64600 > 5 | 78999 | 78609 | 105433 | 106101 > 7 | 108122 | 107529 | 148713 | 145343 v11+128MB degrades above here.. > 17 | 205656 | 209010 | 272676 | 271449 > 27 | 252015 | 254000 | 323983 | 323499 v11+1GB degrades above here.. > 53 | 317928 | 334493 | 446740 | 449641 > 83 | 299234 | 327738 | 437035 | 443113 > 107 | 290089 | 322025 | 430535 | 431530 > 139 | 277294 | 314384 | 422076 | 423606 > 163 | 269029 | 310114 | 416229 | 417412 > 191 | 257315 | 306530 | 408487 | 416170 > 211 | 249743 | 304278 | 404766 | 416393 > 239 | 243333 | 310974 | 397139 | 428167 > 271 | 236356 | 309215 | 389972 | 427498 > 307 | 229094 | 307519 | 382444 | 425891 > 353 | 224385 | 305366 | 375020 | 423284 > 397 | 218549 | 302577 | 364373 | 420846 > > 2 sockets > > conns | master | patch-v11 | master 1G | patch-v11 1G > --------+------------+------------+------------+------------ > 1 | 27287 | 27631 | 32943 | 32493 > 2 | 52397 | 54011 | 64572 | 63596 > 3 | 76157 | 80473 | 93363 | 93528 > 5 | 127075 | 134310 | 153176 | 149984 > 7 | 177100 | 176939 | 216356 | 211599 > 17 | 379047 | 383179 | 464249 | 470351 > 27 | 545219 | 546706 | 664779 | 662488 > 53 | 728142 | 728123 | 857454 | 869407 > 83 | 918276 | 957722 | 1215252 | 1203443 v11+1GB degrades above here.. > 107 | 884112 | 971797 | 1206930 | 1234606 > 139 | 822564 | 970920 | 1167518 | 1233230 > 163 | 788287 | 968248 | 1130021 | 1229250 > 191 | 772406 | 959344 | 1097842 | 1218541 > 211 | 756085 | 955563 | 1077747 | 1209489 > 239 | 732926 | 948855 | 1050096 | 1200878 > 271 | 692999 | 941722 | 1017489 | 1194012 > 307 | 668241 | 920478 | 994420 | 1179507 > 353 | 642478 | 908645 | 968648 | 1174265 > 397 | 617673 | 893568 | 950736 | 1173411 > > 2 sockets 3 keys > > conns | master | patch-v11 | master 1G | patch-v11 1G > --------+------------+------------+------------+------------ > 1 | 16722 | 16393 | 20340 | 21813 > 2 | 32057 | 32009 | 39993 | 42959 > 3 | 46202 | 47678 | 59216 | 64374 > 5 | 78882 | 72002 | 98054 | 103731 > 7 | 103398 | 99538 | 135098 | 135828 v11+128MB degrades above here.. > 17 | 205863 | 217781 | 293958 | 299690 > 27 | 283526 | 290539 | 414968 | 411219 > 53 | 336717 | 356130 | 460596 | 474563 > 83 | 307310 | 342125 | 419941 | 469989 > 107 | 294059 | 333494 | 405706 | 469593 > 139 | 278453 | 328031 | 390984 | 470553 > 163 | 270833 | 326457 | 384747 | 470977 > 191 | 259591 | 322590 | 376582 | 470335 > 211 | 263584 | 321263 | 375969 | 469443 > 239 | 257135 | 316959 | 370108 | 470904 > 271 | 251107 | 315393 | 365794 | 469517 > 307 | 246605 | 311585 | 360742 | 467566 > 353 | 236899 | 308581 | 353464 | 466936 > 397 | 249036 | 305042 | 344673 | 466842 > > I skipped v10 since I used it internally for variant > "insert entry with dummy index then search victim". Up to about 15%(?) of gain is great. I'm not sure it is okay that it seems to slow by about 1%.. Ah, I see. regards. -- Kyotaro Horiguchi NTT Open Source Software Center