At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <amit.kapil...@gmail.com> wrote in > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi > <horikyota....@gmail.com> wrote: > > > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapil...@gmail.com> > > wrote in > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi > > > <horikyota....@gmail.com> wrote: > > By the way I'm not sure that actually happens, but if one smgrextend > > call exnteded the relation by two or more blocks, the cache is > > invalidated and succeeding smgrnblocks returns lseek()'s result. > > > > Can you think of any such case? I think in recovery we use > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page > which seems to be extending page-by-page but there could be some case > where that is not true. One idea is to run regressions and add an > Assert to see if we are extending more than a block during recovery.
I agree with you. Actually XLogReadBufferExtended is the only point to read a page while recovery and seems calling ReadBufferWithoutRelcache page by page up to the target page. The only case I found where the cache is invalidated is ALTER TABLE SET TABLESPACE while wal_level=minimal and not during recovery. smgrextend is called without smgrnblocks called at the time. Considering that the behavior of lseek can be a problem only just after extending a file, an assertion in smgrextend seems to be enough. Although, I'm not confident on the diagnosis. --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, if (reln->smgr_cached_nblocks[forknum] == blocknum) reln->smgr_cached_nblocks[forknum] = blocknum + 1; else + { + /* + * DropRelFileNodeBuffers relies on the behavior that nblocks cache + * won't be invalidated by file extension while recoverying. + */ + Assert(!InRecovery); reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber; + } } > > Don't > > we need to guarantee the cache to be valid while recovery? > > > > One possibility could be that we somehow detect that the value we are > using is cached one and if so then only do this optimization. I basically like this direction. But I'm not sure the additional parameter for smgrnblocks is acceptable. But on the contrary, it might be a better design that DropRelFileNodeBuffers gives up the optimization when smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber. @@ -544,9 +551,12 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, /* * smgrnblocks() -- Calculate the number of blocks in the * supplied relation. + * + * Returns InvalidBlockNumber if must_accurate is true and smgr_cached_nblocks + * is not available. */ BlockNumber -smgrnblocks(SMgrRelation reln, ForkNumber forknum) +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool must_accurate) { BlockNumber result; @@ -561,6 +571,17 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum) reln->smgr_cached_nblocks[forknum] = result; + /* + * We cannot believe the result from smgr_nblocks is always accurate + * because lseek of buggy Linux kernels doesn't account for a recent + * write. However, we can rely on the result from lseek while recovering + * because the first call to this function is not happen just after a file + * extension. Return values on subsequent calls return cached nblocks, + * which should be accurate during recovery. + */ + if (!InRecovery && must_accurate) + return InvalidBlockNumber; + return result; } regards. -- Kyotaro Horiguchi NTT Open Source Software Center