It's been tedious to get it exactly right but I think I got it. FYI, I was delayed because today we had yet another customer hit this: 'redo max offset' error. The system crashed as a number of autovacuums and a checkpoint happened and then the REDO failure.
Two tiny code changes: bufmgr.c:bufferSync() pg_usleep(10000000); // At begin of function smgr.c:smgrtruncate(): Add the following just after CacheInvalidateSmgr() if (forknum == MAIN_FORKNUM && nblocks == 0) { pg_usleep(40000000); { char *cp=NULL; *cp=13; } } Now for the heavily commented SQL repro. It will require that you execute a checkpoint in another session when instructed by the repro.sql script. You have 4 seconds to do that. The repro script explains exactly what must happen. ----------------------------------------------------------- create table t (c char(1111)); alter table t alter column c set storage plain; -- Make sure there actually is an allocated page 0 and it is empty. -- REDO Delete would ignore a non-existant page: XLogReadBufferForRedoExtended: return BLK_NOTFOUND; -- Hopefully two row deletes don't trigger autovacuum and truncate the empty page. insert into t values ('1'), ('2'); delete from t; checkpoint; -- Checkpoint the empty page to disk. -- This insert should be before the next checkpoint 'start'. I don't want to replay it. -- And, yes, there needs to be another checkpoint completed to skip its replay and start -- with the replay of the delete below. insert into t values ('1'), ('2'); -- Checkpoint needs to start in another session. However, I need to stall the checkpoint -- to prevent it from writing the dirty page to disk until I get to the vacuum below. select 'Please start checkpoint in another session'; select pg_sleep(4); -- Below is the problematic delete. -- It succeeds now(online) because the dirty page has two rows on it. -- However, with respect to crash recovery there are 3 possible scenarios depending on timing. -- 1) The ongoing checkpoint might write the page with the two rows on it before -- the deletes. This leads to BLK_NEEDS_REDO for the deletes. It works -- because the page read from disk has the rows on it. -- 2) The ongoing checkpoint might write the page just after the deletes. -- In that case BLK_DONE will happen and there'll be no problem. LSN check. -- 3) The checkpoint can fail to write the dirty page because a vacuum can call -- smgrtruncate->DropRelFileNodeBuffers() which invalidates the dirty page. -- If smgrtruncate safely completes the physical truncation then BLK_NOTFOUND -- happens and we skip the redo of the delete. So the skipped dirty write is OK. -- The problme happens if we crash after the 2nd checkpoint completes -- but before the physical truncate 'mdtruncate()'. delete from t; -- The vacuum must complete DropRelFileNodeBuffers. -- The vacuum must sleep for a few seconds to allow the checkpoint to complete -- such that recovery starts with the Delete REDO. -- We must crash before mdtruncate() does the physical truncate. If the physical -- truncate happens the BLK_NOTFOUND will be returned and the Delete REDO skipped. vacuum t; -------------------------------------------------------- > On November 10, 2019 at 11:51 PM Michael Paquier < mich...@paquier.xyz > mailto:mich...@paquier.xyz > wrote: > > > On Fri, Nov 08, 2019 at 06:44:08PM -0800, Daniel Wood wrote: > > > > I repro'ed on PG11 and PG10 STABLE but several months old. > > I looked at 6d05086 but it doesn't address the core issue. > > > > DropRelFileNodeBuffers prevents the checkpoint from writing all > > needed dirty pages for any REDO's that exist BEFORE the truncate. > > If we crash after a checkpoint but before the physical truncate then > > the REDO will need to replay the operation against the dirty page > > that the Drop invalidated. > > > > > I am beginning to look at this thread more seriously, and I'd > > like to > first try to reproduce that by myself. Could you share the steps you > used to do that? This includes any manual sleep calls you may have > added, the timing of the crash, manual checkpoints, debugger > breakpoints, etc. It may be possible to extract some more generic > test from that. > -- > Michael >