On 14/04/2020 22:04, Teja Mupparti wrote:
Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate,

+1

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

  Repro details (when full page write = off)

          1) Page on disk has empty LP 1, Insert into page LP 1
          2) checkpoint START (Recovery REDO eventually starts here)
          3) Delete all rows on the page (page is empty now)
          4) Autovacuum kicks in and truncates the pages
                 DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
          5) Checkpoint completes
          6) Crash
         7) smgrtruncate - Not reached (this is where we do the physical truncate)

  Now the crash-recovery starts

         Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and truncating the file would open a different issue:

  1) Page on disk has empty LP 1, Insert into page LP 1
  2) checkpoint START (Recovery REDO eventually starts here)
  3) Delete all rows on the page (page is empty now)
  4) Autovacuum kicks in and starts truncating
  5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before truncating the file to fix that, but I'm wary of that approach. Firstly, it requires scanning the buffers that are dropped twice, which can take a long time. I remember that people have already complained that DropRelFileNodeBuffers() is slow, when it has to scan all the buffers once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems bad. For starters, because you're not holding buffer's I/O lock, I believe the checkpointer would busy-wait on the buffers until the truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from completing, until all in-progress truncations have completed. We have a mechanism to wait out in-progress commits at the beginning of a checkpoint, right after the redo point has been established. See comments around the GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could have a similar mechanism to wait out the truncations before *completing* a checkpoint.

- Heikki


Reply via email to