non-blocking delayChkpt

Andres Freund Tue, 20 Apr 2021 18:56:41 -0700

Hi,

During commits, and some other places, there's a short phase at which we
block checkpoints from starting:


                /*
                 * Mark ourselves as within our "commit critical section".  This
                 * forces any concurrent checkpoint to wait until we've updated
                 * pg_xact.  Without this, it is possible for the checkpoint to 
set
                 * REDO after the XLOG record but fail to flush the pg_xact 
update to
                 * disk, leading to loss of the transaction commit if the system
                 * crashes a little later.

One problem in the shared memory stats patch was that, to get rid of the
O(N) cost of pgstat_vacuum_stat(), commits/aborts should inform which
stats they drop.

Because we wouldn't do the dropping of stats as part of
RecordTransactionCommit()'s critical section, that would have the danger
of the stats dropping not being executed if we crash after WAL logging
the commit record, but before dropping the stats.

It's worthwhile to note that currently dropping of relfilenodes (e.g. a
committing DROP TABLE or an aborting CREATE TABLE) has the same issue.


An obvious way to address that would be to set delayChkpt not just for
part of RecordTransactionCommit()/Abort(), but also during the
relfilenode/stats dropping. But obviously that'd make it much more
likely that we'd actually prevent checkpoints from starting for a
significant amount of time.

Which lead me to wonder why we need to *block* when starting a
checkpoint, waiting for a moment in which there are no concurrent
commits?

I think we could replace the boolean per-backend delayChkpt with
per-backend LSNs that indicate an LSN that for the backend won't cause
recovery issues. For commits this LSN could e.g. be the current WAL
insert location, just before the XLogInsert() (but I think we could
optimize that a bit, but that's details).  CreateCheckPoint() would then
not loop over HaveVirtualXIDsDelayingChkpt() before completing a
checkpoint, but instead compute the oldest LSN that any backend needs to
be included in the checkpoint.

Moving the redo pointer to before where any backend is in a commit
critical section seems to provide sufficient (and I think sometimes
stronger) protection against the hazards that delayChkpt aims to
prevent? And it could do so without blocking.


I think the blocking by delayChkpt is already an issue in some busy
workloads, although it's hard to tell how much outside of artificial
workloads against modified versions of PG, given that we don't expose
such waits anywhere.  Particularly that we now set delayChkpt in
MarkBufferDirtyHint() seems to make that a lot more likely.


Does this seem like a viable idea, or did I entirely miss the boat?

Greetings,

Andres Freund

non-blocking delayChkpt

Reply via email to