Hello, Andres В Пт, 25/02/2022 в 00:04 -0800, Andres Freund пишет: > Hi, > > On 2022-02-21 11:06:49 +0300, Yura Sokolov wrote: > > From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001 > > From: Yura Sokolov <y.soko...@postgrespro.ru> > > Date: Mon, 21 Feb 2022 08:49:03 +0300 > > Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks. > > > > Acquiring two partition locks leads to complex dependency chain that hurts > > at high concurrency level. > > > > There is no need to hold both lock simultaneously. Buffer is pinned so > > other processes could not select it for eviction. If tag is cleared and > > buffer removed from old partition other processes will not find it. > > Therefore it is safe to release old partition lock before acquiring > > new partition lock. > > Yes, the current design is pretty nonsensical. It leads to really absurd stuff > like holding the relation extension lock while we write out old buffer > contents etc. > > > > > + * We have pinned buffer and we are single pinner at the moment so there > > + * is no other pinners. > > Seems redundant. > > > > We hold buffer header lock and exclusive partition > > + * lock if tag is valid. Given these statements it is safe to clear tag > > + * since no other process can inspect it to the moment. > > + */ > > Could we share code with InvalidateBuffer here? It's not quite the same code, > but nearly the same. > > > > + * The usage_count starts out at 1 so that the buffer can survive one > > + * clock-sweep pass. > > + * > > + * We use direct atomic OR instead of Lock+Unlock since no other backend > > + * could be interested in the buffer. But StrategyGetBuffer, > > + * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them > > to > > + * compare tag, and UnlockBufHdr does raw write to state. So we have to > > + * spin if we found buffer locked. > > So basically the first half of of the paragraph is wrong, because no, we > can't?
Logically, there are no backends that could be interesting in the buffer. Physically they do LockBufHdr/UnlockBufHdr just to check they are not interesting. > > + * Note that we write tag unlocked. It is also safe since there is > > always > > + * check for BM_VALID when tag is compared. > > > > */ > > buf->tag = newTag; > > - buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | > > - BM_CHECKPOINT_NEEDED | BM_IO_ERROR | > > BM_PERMANENT | > > - BUF_USAGECOUNT_MASK); > > if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == > > INIT_FORKNUM) > > - buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; > > + new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE; > > else > > - buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE; > > - > > - UnlockBufHdr(buf, buf_state); > > + new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE; > > > > - if (oldPartitionLock != NULL) > > + buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); > > + while (unlikely(buf_state & BM_LOCKED)) > > I don't think it's safe to atomic in arbitrary bits. If somebody else has > locked the buffer header in this moment, it'll lead to completely bogus > results, because unlocking overwrites concurrently written contents (which > there shouldn't be any, but here there are)... That is why there is safety loop in the case buf->state were locked just after first optimistic atomic_fetch_or. 99.999% times this loop will not have a job. But in case other backend did lock buf->state, loop waits until it releases lock and retry atomic_fetch_or. > And or'ing contents in also doesn't make sense because we it doesn't work to > actually unset any contents? Sorry, I didn't understand sentence :(( > Why don't you just use LockBufHdr/UnlockBufHdr? This pair makes two atomic writes to memory. Two writes are heavier than one write in this version (if optimistic case succeed). But I thought to use Lock+UnlockBuhHdr instead of safety loop: buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits); if (unlikely(buf_state & BM_LOCKED)) { buf_state = LockBufHdr(&buf->state); UnlockBufHdr(&buf->state, buf_state | new_bits); } I agree this way code is cleaner. Will do in next version. ----- regards, Yura Sokolov