Good day, Kyotaro Horiguchi and hackers. В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет: > At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.soko...@postgrespro.ru> > wrote in > > Hello, all. > > > > I thought about patch simplification, and tested version > > without BufTable and dynahash api change at all. > > > > It performs suprisingly well. It is just a bit worse > > than v1 since there is more contention around dynahash's > > freelist, but most of improvement remains. > > > > I'll finish benchmarking and will attach graphs with > > next message. Patch is attached here. > > Thanks for the new patch. The patch as a whole looks fine to me. But > some comments needs to be revised.
Thank you for review and remarks. > > (existing comments) > > * To change the association of a valid buffer, we'll need to have > > * exclusive lock on both the old and new mapping partitions. > ... > > * Somebody could have pinned or re-dirtied the buffer while we were > > * doing the I/O and making the new hashtable entry. If so, we can't > > * recycle this buffer; we must undo everything we've done and start > > * over with a new victim buffer. > > We no longer take a lock on the new partition and have no new hash > entry (if others have not yet done) at this point. fixed > + * Clear out the buffer's tag and flags. We must do this to ensure > that > + * linear scans of the buffer array don't think the buffer is valid. > We > > The reason we can clear out the tag is it's safe to use the victim > buffer at this point. This comment needs to mention that reason. Tried to describe. > + * > + * Since we are single pinner, there should no be PIN_COUNT_WAITER or > + * IO_IN_PROGRESS (flags that were not cleared in previous code). > + */ > + Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0); > > It seems like to be a test for potential bugs in other functions. As > the comment is saying, we are sure that no other processes are pinning > the buffer and the existing code doesn't seem to be care about that > condition. Is it really needed? Ok, I agree this check is excess. These two flags were not cleared in the previous code, and I didn't get why. Probably, it is just a historical accident. > > + /* > + * Try to make a hashtable entry for the buffer under its new tag. > This > + * could fail because while we were writing someone else allocated > another > > The most significant point of this patch is the reason that the victim > buffer is protected from stealing until it is set up for new tag. I > think we need an explanation about the protection here. I don't get what you mean clearly :( . I would appreciate your suggestion for this comment. > > > + * buffer for the same block we want to read in. Note that we have > not yet > + * removed the hashtable entry for the old tag. > > Since we have removed the hash table entry for the old tag at this > point, the comment got wrong. Thanks, changed. > + * the first place. First, give up the buffer we were > planning to use > + * and put it to free lists. > .. > + StrategyFreeBuffer(buf); > > This is one downside of this patch. But it seems to me that the odds > are low that many buffers are freed in a short time by this logic. By > the way it would be better if the sentence starts with "First" has a > separate comment section. Splitted the comment. > (existing comment) > | * Okay, it's finally safe to rename the buffer. > > We don't "rename" the buffer here. And the safety is already > establishsed at the end of the oldPartitionLock section. So it would > be just something like "Now allocate the victim buffer for the new > tag"? Changed to "Now it is safe to use victim buffer for new tag." There is also tiny code change at block reuse finalization: instead of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected with WaitBufHdrUnlocked. I've tried to explain its safety. Please, check it. Benchmarks: - base point is 6ce16088bfed97f9. - notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets) - pgbench select only scale 100 (1.5GB on disk) - two shared_buffers values: 128MB and 1GB. - enabled hugepages - second best result from five runs Notebook: conns | master | patch_v3 | master 1G | patch_v3 1G --------+------------+------------+------------+------------ 1 | 29508 | 29481 | 31774 | 32305 2 | 57139 | 56694 | 63393 | 62968 3 | 89759 | 90861 | 101873 | 102399 5 | 133491 | 134573 | 145451 | 145009 7 | 156739 | 155832 | 164633 | 164562 17 | 216863 | 216379 | 251923 | 251017 27 | 209532 | 209802 | 244202 | 243709 53 | 212615 | 213552 | 248107 | 250317 83 | 214446 | 218230 | 252414 | 252337 107 | 211276 | 217109 | 252762 | 250328 139 | 208070 | 214265 | 248350 | 249684 163 | 206764 | 214594 | 247369 | 250323 191 | 205478 | 213511 | 244597 | 246877 211 | 200976 | 212976 | 244035 | 245032 239 | 196588 | 211519 | 243897 | 245055 271 | 195813 | 209631 | 237457 | 242771 307 | 192724 | 208074 | 237658 | 241759 353 | 187847 | 207189 | 234548 | 239008 397 | 186942 | 205317 | 230465 | 238782 I don't get why numbers changed from first letter )) But still no slowdown, and measurable gain at 128MB shared buffers. Xeon 1 socket conns | master | patch_v3 | master 1G | patch_v3 1G --------+------------+------------+------------+------------ 1 | 41975 | 41799 | 52898 | 52715 2 | 77693 | 77531 | 97571 | 98547 3 | 114713 | 114533 | 142709 | 143579 5 | 188898 | 187241 | 239322 | 236682 7 | 261516 | 260249 | 329119 | 328562 17 | 521821 | 518981 | 672390 | 662987 27 | 555487 | 557019 | 674630 | 675703 53 | 868213 | 897097 | 1190734 | 1202575 83 | 868232 | 881705 | 1164997 | 1157764 107 | 850477 | 855169 | 1140597 | 1128027 139 | 816311 | 826756 | 1101471 | 1096197 163 | 794788 | 805946 | 1078445 | 1071535 191 | 765934 | 783209 | 1059497 | 1039936 211 | 738656 | 786171 | 1083356 | 1049025 239 | 713124 | 837040 | 1104629 | 1125969 271 | 692138 | 847741 | 1094432 | 1131968 307 | 682919 | 847939 | 1086306 | 1124649 353 | 679449 | 844596 | 1071482 | 1125980 397 | 676217 | 833009 | 1058937 | 1113496 Here is small slowdown at some connection numbers (17, 107-191).It is reproducible. Probably it is due to one more atomice write. Perhaps for some other scheduling issues ( processes block less on buffer manager but compete more on other resources). I could not reliably determine why, because change is too small, and `perf record` harms performance more at this point. This is the reason I've changed finalization to atomic_or instead of Lock+Unlock pair. The changed helped a bit, but didn't remove slowdown completely. Xeon 2 socket conns | m0 | patch_v3 | m0 1G | patch_v3 1G --------+------------+------------+------------+------------ 1 | 44317 | 43747 | 53920 | 53759 2 | 81193 | 79976 | 99138 | 99213 3 | 120755 | 114481 | 148102 | 146494 5 | 190007 | 187384 | 232078 | 229627 7 | 258602 | 256657 | 325545 | 322417 17 | 551814 | 549041 | 692312 | 688204 27 | 787353 | 787916 | 1023509 | 1020995 53 | 973880 | 996019 | 1228274 | 1246128 83 | 1108442 | 1258589 | 1596292 | 1662586 107 | 1072188 | 1317229 | 1542401 | 1684603 139 | 1000446 | 1272759 | 1490757 | 1672507 163 | 967378 | 1224513 | 1461468 | 1660012 191 | 926010 | 1178067 | 1435317 | 1645886 211 | 909919 | 1148862 | 1417437 | 1629487 239 | 895944 | 1108579 | 1393530 | 1616824 271 | 880545 | 1078280 | 1374878 | 1608412 307 | 865560 | 1056988 | 1355164 | 1601066 353 | 857591 | 1033980 | 1330069 | 1586769 397 | 840374 | 1016690 | 1312257 | 1573376 regards, Yura Sokolov Postgres Professional y.soko...@postgrespro.ru funny.fal...@gmail.com
From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001 From: Yura Sokolov <y.soko...@postgrespro.ru> Date: Mon, 21 Feb 2022 08:49:03 +0300 Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks. Acquiring two partition locks leads to complex dependency chain that hurts at high concurrency level. There is no need to hold both lock simultaneously. Buffer is pinned so other processes could not select it for eviction. If tag is cleared and buffer removed from old partition other processes will not find it. Therefore it is safe to release old partition lock before acquiring new partition lock. Tags: lwlock_numa --- src/backend/storage/buffer/bufmgr.c | 206 ++++++++++++++-------------- 1 file changed, 105 insertions(+), 101 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3f..bb8b1cd2f4b 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, BufferDesc *buf; bool valid; uint32 buf_state; + uint32 new_bits; /* create a tag so we can lookup the buffer */ INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum); @@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, oldHash = BufTableHashCode(&oldTag); oldPartitionLock = BufMappingPartitionLock(oldHash); - /* - * Must lock the lower-numbered partition first to avoid - * deadlocks. - */ - if (oldPartitionLock < newPartitionLock) - { - LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } - else if (oldPartitionLock > newPartitionLock) - { - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); - } - else - { - /* only one partition, only one lock */ - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); - } + LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE); } else { - /* if it wasn't valid, we need only the new partition */ - LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); /* remember we have no old-partition lock or tag */ oldPartitionLock = NULL; /* keep the compiler quiet about uninitialized variables */ oldHash = 0; } - /* - * Try to make a hashtable entry for the buffer under its new tag. - * This could fail because while we were writing someone else - * allocated another buffer for the same block we want to read in. - * Note that we have not yet removed the hashtable entry for the old - * tag. - */ - buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); - - if (buf_id >= 0) - { - /* - * Got a collision. Someone has already done what we were about to - * do. We'll just handle this as if it were found in the buffer - * pool in the first place. First, give up the buffer we were - * planning to use. - */ - UnpinBuffer(buf, true); - - /* Can give up that buffer's mapping partition lock now */ - if (oldPartitionLock != NULL && - oldPartitionLock != newPartitionLock) - LWLockRelease(oldPartitionLock); - - /* remaining code should match code at top of routine */ - - buf = GetBufferDescriptor(buf_id); - - valid = PinBuffer(buf, strategy); - - /* Can release the mapping lock as soon as we've pinned it */ - LWLockRelease(newPartitionLock); - - *foundPtr = true; - - if (!valid) - { - /* - * We can only get here if (a) someone else is still reading - * in the page, or (b) a previous read attempt failed. We - * have to wait for any active read attempt to finish, and - * then set up our own read attempt if the page is still not - * BM_VALID. StartBufferIO does it all. - */ - if (StartBufferIO(buf, true)) - { - /* - * If we get here, previous attempts to read the buffer - * must have failed ... but we shall bravely try again. - */ - *foundPtr = false; - } - } - - return buf; - } - /* * Need to lock the buffer header too in order to change its tag. */ @@ -1382,52 +1306,132 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, /* * Somebody could have pinned or re-dirtied the buffer while we were - * doing the I/O and making the new hashtable entry. If so, we can't - * recycle this buffer; we must undo everything we've done and start - * over with a new victim buffer. + * doing the I/O. If so, we can't recycle this buffer; we must undo + * everything we've done and start over with a new victim buffer. */ oldFlags = buf_state & BUF_FLAG_MASK; if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY)) break; UnlockBufHdr(buf, buf_state); - BufTableDelete(&newTag, newHash); - if (oldPartitionLock != NULL && - oldPartitionLock != newPartitionLock) + if (oldPartitionLock != NULL) LWLockRelease(oldPartitionLock); - LWLockRelease(newPartitionLock); UnpinBuffer(buf, true); } /* - * Okay, it's finally safe to rename the buffer. + * Clear out the buffer's tag and flags. We must do this to ensure that + * linear scans of the buffer array don't think the buffer is valid. We + * also reset the usage_count since any recency of use of the old content + * is no longer relevant. * - * Clearing BM_VALID here is necessary, clearing the dirtybits is just - * paranoia. We also reset the usage_count since any recency of use of - * the old content is no longer relevant. (The usage_count starts out at - * 1 so that the buffer can survive one clock-sweep pass.) + * We have pinned buffer and we are single pinner at the moment so there + * is no other pinners. We hold buffer header lock and exclusive partition + * lock if tag is valid. Given these statements it is safe to clear tag + * since no other process can inspect it to the moment. + */ + CLEAR_BUFFERTAG(buf->tag); + buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK); + UnlockBufHdr(buf, buf_state); + + /* Delete old tag from hash table if it were valid. */ + if (oldFlags & BM_TAG_VALID) + BufTableDelete(&oldTag, oldHash); + + if (oldPartitionLock != newPartitionLock) + { + if (oldPartitionLock != NULL) + LWLockRelease(oldPartitionLock); + LWLockAcquire(newPartitionLock, LW_EXCLUSIVE); + } + + /* + * Try to make a hashtable entry for the buffer under its new tag. This + * could fail because while we were writing someone else allocated another + * buffer for the same block we want to read in. In that case we will have + * to return our buffer to free list. + */ + buf_id = BufTableInsert(&newTag, newHash, buf->buf_id); + + if (buf_id >= 0) + { + /* + * Got a collision. Someone has already done what we were about to do. + * We'll just handle this as if it were found in the buffer pool in + * the first place. + */ + + /* + * First, give up the buffer we were planning to use and put it to + * free lists. + */ + UnpinBuffer(buf, true); + StrategyFreeBuffer(buf); + + /* remaining code should match code at top of routine */ + + buf = GetBufferDescriptor(buf_id); + + valid = PinBuffer(buf, strategy); + + /* Can release the mapping lock as soon as we've pinned it */ + LWLockRelease(newPartitionLock); + + *foundPtr = true; + + if (!valid) + { + /* + * We can only get here if (a) someone else is still reading in + * the page, or (b) a previous read attempt failed. We have to + * wait for any active read attempt to finish, and then set up our + * own read attempt if the page is still not BM_VALID. + * StartBufferIO does it all. + */ + if (StartBufferIO(buf, true)) + { + /* + * If we get here, previous attempts to read the buffer must + * have failed ... but we shall bravely try again. + */ + *foundPtr = false; + } + } + + return buf; + } + + /* + * Now it is safe to use victim buffer for new tag. * * Make sure BM_PERMANENT is set for buffers that must be written at every * checkpoint. Unlogged buffers only need to be written at shutdown * checkpoints, except for their "init" forks, which need to be treated * just like permanent relations. + * + * The usage_count starts out at 1 so that the buffer can survive one + * clock-sweep pass. + * + * We use direct atomic OR instead of Lock+Unlock since no other backend + * could be interested in the buffer. But StrategyGetBuffer, + * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to + * compare tag, and UnlockBufHdr does raw write to state. So we have to + * spin if we found buffer locked. + * + * Note that we write tag unlocked. It is also safe since there is always + * check for BM_VALID when tag is compared. */ buf->tag = newTag; - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | - BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | - BUF_USAGECOUNT_MASK); if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM) - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; else - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; - - UnlockBufHdr(buf, buf_state); + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE; - if (oldPartitionLock != NULL) + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); + while (unlikely(buf_state & BM_LOCKED)) { - BufTableDelete(&oldTag, oldHash); - if (oldPartitionLock != newPartitionLock) - LWLockRelease(oldPartitionLock); + WaitBufHdrUnlocked(buf); + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); } LWLockRelease(newPartitionLock); -- 2.35.1
<<attachment: res.zip>>