It's been tedious to get it exactly right but I think I got it.  FYI, I was 
delayed because today we had yet another customer hit this:  'redo max offset' 
error.  The system crashed as a number of autovacuums and a checkpoint happened 
and then the REDO failure.

Two tiny code changes:
bufmgr.c:bufferSync()   pg_usleep(10000000);  // At begin of function

smgr.c:smgrtruncate():  Add the following just after CacheInvalidateSmgr()
if (forknum == MAIN_FORKNUM && nblocks == 0) {
pg_usleep(40000000);
{ char *cp=NULL; *cp=13; }
}

Now for the heavily commented SQL repro.  It will require that you execute a 
checkpoint in another session when instructed by the repro.sql script.  You 
have 4 seconds to do that. The repro script explains exactly what must happen.

-----------------------------------------------------------
create table t (c char(1111));
alter table t alter column c set storage plain;
-- Make sure there actually is an allocated page 0 and it is empty.
-- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: 
return BLK_NOTFOUND;
-- Hopefully two row deletes don't trigger autovacuum and truncate the empty 
page.
insert into t values ('1'), ('2');
delete from t;
checkpoint; -- Checkpoint the empty page to disk.
-- This insert should be before the next checkpoint 'start'. I don't want to 
replay it.
-- And, yes, there needs to be another checkpoint completed to skip its replay 
and start
-- with the replay of the delete below.
insert into t values ('1'), ('2');
-- Checkpoint needs to start in another session. However, I need to stall the 
checkpoint
-- to prevent it from writing the dirty page to disk until I get to the vacuum 
below.
select 'Please start checkpoint in another session';
select pg_sleep(4);
-- Below is the problematic delete.
-- It succeeds now(online) because the dirty page has two rows on it.
-- However, with respect to crash recovery there are 3 possible scenarios 
depending on timing.
-- 1) The ongoing checkpoint might write the page with the two rows on it before
-- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works
-- because the page read from disk has the rows on it.
-- 2) The ongoing checkpoint might write the page just after the deletes.
-- In that case BLK_DONE will happen and there'll be no problem. LSN check.
-- 3) The checkpoint can fail to write the dirty page because a vacuum can call
-- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page.
-- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND
-- happens and we skip the redo of the delete. So the skipped dirty write is OK.
-- The problme happens if we crash after the 2nd checkpoint completes
-- but before the physical truncate 'mdtruncate()'.
delete from t;
-- The vacuum must complete DropRelFileNodeBuffers.
-- The vacuum must sleep for a few seconds to allow the checkpoint to complete
-- such that recovery starts with the Delete REDO.
-- We must crash before mdtruncate() does the physical truncate. If the physical
-- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO 
skipped.
vacuum t;
--------------------------------------------------------


>     On November 10, 2019 at 11:51 PM Michael Paquier < mich...@paquier.xyz 
> mailto:mich...@paquier.xyz > wrote:
> 
> 
>     On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote:
> 
>         > >         I repro'ed on PG11 and PG10 STABLE but several months old.
> >         I looked at 6d05086 but it doesn't address the core issue.
> > 
> >         DropRelFileNodeBuffers prevents the checkpoint from writing all
> >         needed dirty pages for any REDO's that exist BEFORE the truncate.
> >         If we crash after a checkpoint but before the physical truncate then
> >         the REDO will need to replay the operation against the dirty page
> >         that the Drop invalidated.
> > 
> >     >     I am beginning to look at this thread more seriously, and I'd 
> > like to
>     first try to reproduce that by myself. Could you share the steps you
>     used to do that? This includes any manual sleep calls you may have
>     added, the timing of the crash, manual checkpoints, debugger
>     breakpoints, etc. It may be possible to extract some more generic
>     test from that.
>     --
>     Michael
> 

Reply via email to