On Thu, Nov 19, 2020 at 12:37 PM tsunakawa.ta...@fujitsu.com <tsunakawa.ta...@fujitsu.com> wrote: > > From: Andres Freund <and...@anarazel.de> > > > Smaller comment: > > > > +static void > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum, > > int nforks, > > + BlockNumber > > *nForkBlocks, BlockNumber *firstDelBlock) > > ... > > + /* Check that it is in the buffer pool. If not, do > > nothing. > > */ > > + LWLockAcquire(bufPartitionLock, LW_SHARED); > > + buf_id = BufTableLookup(&bufTag, bufHash); > > ... > > + bufHdr = GetBufferDescriptor(buf_id); > > + > > + buf_state = LockBufHdr(bufHdr); > > + > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && > > + bufHdr->tag.forkNum == forkNum[i] && > > + bufHdr->tag.blockNum >= firstDelBlock[i]) > > + InvalidateBuffer(bufHdr); /* releases > > spinlock */ > > + else > > + UnlockBufHdr(bufHdr, buf_state); > > > > I'm a bit confused about the check here. We hold a buffer partition lock, > > and > > have done a lookup in the mapping table. Why are we then rechecking the > > relfilenode/fork/blocknum? And why are we doing so holding the buffer header > > lock, which is essentially a spinlock, so should only ever be held for very > > short > > portions? > > > > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but > > there > > the situation is different: We haven't done a buffer mapping lookup, and we > > don't hold a partition lock! > > That's because the buffer partition lock is released immediately after the > hash table has been looked up. As an aside, InvalidateBuffer() requires the > caller to hold the buffer header spinlock and doesn't hold the buffer > partition lock. >
This answers the second part of the question but what about the first part (We hold a buffer partition lock, and have done a lookup in the mapping table. Why are we then rechecking the relfilenode/fork/blocknum?) I think we don't need such a check, rather we can have an Assert corresponding to that if-condition in the patch. I understand it is safe to compare relfilenode/fork/blocknum but it might confuse readers of the code. I have started doing minor edits to the patch especially planning to write a theory why is this optimization safe and here is what I can come up with: "To remove all the pages of the specified relation forks from the buffer pool, we need to scan the entire buffer pool but we can optimize it by finding the buffers from BufMapping table provided we know the exact size of each fork of the relation. The exact size is required to ensure that we don't leave any buffer for the relation being dropped as otherwise the background writer or checkpointer can lead to a PANIC error while flushing buffers corresponding to files that don't exist. To know the exact size, we rely on the size cached for each fork by us during recovery which limits the optimization to recovery and on standbys but we can easily extend it once we have shared cache for relation size. In recovery, we cache the value returned by the first lseek(SEEK_END) and the future writes keeps the cached value up-to-date. See smgrextend. It is possible that the value of the first lseek is smaller than the actual number of existing blocks in the file due to buggy Linux kernels that might not have accounted for the recent write. But that should be fine because there must not be any buffers after that file size. XXX We would make the extra lseek call for the unoptimized paths but that is okay because we do it just for the first fork and we anyway have to scan the entire buffer pool the cost of which is so high that the extra lseek call won't make any visible difference. However, we can use InRecovery flag to avoid the additional cost but that doesn't seem worth it." Thoughts? -- With Regards, Amit Kapila.