Hi, During commits, and some other places, there's a short phase at which we block checkpoints from starting:
/* * Mark ourselves as within our "commit critical section". This * forces any concurrent checkpoint to wait until we've updated * pg_xact. Without this, it is possible for the checkpoint to set * REDO after the XLOG record but fail to flush the pg_xact update to * disk, leading to loss of the transaction commit if the system * crashes a little later. One problem in the shared memory stats patch was that, to get rid of the O(N) cost of pgstat_vacuum_stat(), commits/aborts should inform which stats they drop. Because we wouldn't do the dropping of stats as part of RecordTransactionCommit()'s critical section, that would have the danger of the stats dropping not being executed if we crash after WAL logging the commit record, but before dropping the stats. It's worthwhile to note that currently dropping of relfilenodes (e.g. a committing DROP TABLE or an aborting CREATE TABLE) has the same issue. An obvious way to address that would be to set delayChkpt not just for part of RecordTransactionCommit()/Abort(), but also during the relfilenode/stats dropping. But obviously that'd make it much more likely that we'd actually prevent checkpoints from starting for a significant amount of time. Which lead me to wonder why we need to *block* when starting a checkpoint, waiting for a moment in which there are no concurrent commits? I think we could replace the boolean per-backend delayChkpt with per-backend LSNs that indicate an LSN that for the backend won't cause recovery issues. For commits this LSN could e.g. be the current WAL insert location, just before the XLogInsert() (but I think we could optimize that a bit, but that's details). CreateCheckPoint() would then not loop over HaveVirtualXIDsDelayingChkpt() before completing a checkpoint, but instead compute the oldest LSN that any backend needs to be included in the checkpoint. Moving the redo pointer to before where any backend is in a commit critical section seems to provide sufficient (and I think sometimes stronger) protection against the hazards that delayChkpt aims to prevent? And it could do so without blocking. I think the blocking by delayChkpt is already an issue in some busy workloads, although it's hard to tell how much outside of artificial workloads against modified versions of PG, given that we don't expose such waits anywhere. Particularly that we now set delayChkpt in MarkBufferDirtyHint() seems to make that a lot more likely. Does this seem like a viable idea, or did I entirely miss the boat? Greetings, Andres Freund