Re: Corruption during WAL replay

Heikki Linnakangas Mon, 17 Aug 2020 04:06:44 -0700

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is aconsensus on the critical-section around truncate,

+1

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

  Repro details (when full page write = off)

          1) Page on disk has empty LP 1, Insert into page LP 1
          2) checkpoint START (Recovery REDO eventually starts here)
          3) Delete all rows on the page (page is empty now)
          4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1on disk still empty
          5) Checkpoint completes
          6) Crash
7) smgrtruncate - Not reached (this is where we do thephysical truncate)
  Now the crash-recovery starts
Delete-log-replay (above step-3) reads page with empty LP 1and the delete fails with PANIC (old page on disk with no insert)
Doing recovery, truncate is even not reached, a WAL replay of thetruncation will happen in the future but the recovery fails (repeatedly)even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() andtruncating the file would open a different issue:


  1) Page on disk has empty LP 1, Insert into page LP 1
  2) checkpoint START (Recovery REDO eventually starts here)
  3) Delete all rows on the page (page is empty now)
  4) Autovacuum kicks in and starts truncating
  5) smgrtruncate() truncates the file

6) checkpoint writes out buffers for pages that were just truncatedaway, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress beforetruncating the file to fix that, but I'm wary of that approach. Firstly,it requires scanning the buffers that are dropped twice, which can takea long time. I remember that people have already complained thatDropRelFileNodeBuffers() is slow, when it has to scan all the buffersonce. More importantly, abusing the BM_IO_INPROGRESS flag for this seemsbad. For starters, because you're not holding buffer's I/O lock, Ibelieve the checkpointer would busy-wait on the buffers until thetruncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint fromcompleting, until all in-progress truncations have completed. We have amechanism to wait out in-progress commits at the beginning of acheckpoint, right after the redo point has been established. Seecomments around the GetVirtualXIDsDelayingChkpt() function call inCreateCheckPoint(). We could have a similar mechanism to wait out thetruncations before *completing* a checkpoint.


- Heikki

Re: Corruption during WAL replay

Reply via email to