On Thu, Oct 22, 2020 at 7:33 PM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > At Thu, 22 Oct 2020 14:16:37 +0900 (JST), Kyotaro Horiguchi > <horikyota....@gmail.com> wrote in > > smgrtruncate and msgrextend modifies that cache from their parameter, > > not from lseek(). At the very first the value in the cache comes from > > lseek() but if nothing other than postgres have changed the file size, > > I believe we can rely on the cache even with such a buggy kernels even > > if still exists. > > Mmm. Not exact. The requirement here is that we must be certain that > the we don't have a buffuer for blocks after the file size known to > the process. While recoverying, If the first lseek() returned smaller > size than actual, we cannot have a buffer for the blocks after the > size. After we trncated or extended the file, we are certain that we > don't have a buffer for unknown blocks.
Thanks, I understand now. Something feels fragile about it, perhaps because it's not really acting as a "cache" anymore despite its name, but I see the logic now. It becomes the authoritative source of information, even if the kernel decides to make our file smaller asynchronously. > > If there's no longer such a buggy kernel, we can rely on lseek() only > > when InRecovery. If we had synchronized file size cache we could rely > > on the cache even while !InRecovery. (I'm not sure about how vacuum > > affects, though.) Perhaps the buggy kernel of 2006 is actually Linux working as designed according to its philosophy on ejecting dirty buffers on writeback failure (and apparently adjusting the size at the same time). At least in 2020 it'll tell us about the problem that caused that when we next perform an operation that reads the error counter, but in the case of a relation we're dropping -- the use case in this thread -- that won't happen! (I mean, something else will probably tell you your system is toast pretty soon, but this particular condition may be undetected). I think a synchronised file size cache wouldn't be enough to use this trick outside the recovery process, because the initial value would come from a call to lseek(), but unlike recovery, that wouldn't happen *before* we start putting pages in the buffer pool. Also, if we one day have a size-limited relcache, even recovery could get into trouble, if it evicts the RelationData that holds the authoritative nblocks value.