On 27/05/2024 21:20, Michael Paquier wrote:
On Fri, May 17, 2024 at 04:25:15PM -0400, Robert Haas wrote:
On Fri, May 17, 2024 at 4:20 PM Jeff Davis <pg...@j-davis.com> wrote:
Regarding this particular change: the checkpointing hook seems more
like a table AM feature, so I agree with you that we should have a good
idea how a real table AM might use this, rather than only
pg_stat_statements.
I would even be OK with a pg_stat_statements example that is fully
working and fully explained. I just don't want to have no example at
all. The original proposal has been changed twice because of
complaints that the hook wasn't quite useful enough, but I think that
only proves that v3 is closer to being useful than v1. If v1 is 40% of
the way to useful and v3 is 120% of the way to useful, wonderful! But
if v1 is 20% of the way to being useful and v3 is 60% of the way to
being useful, it's not time to commit anything yet. I don't know which
is the case, and I think if someone wants this to be committed, they
need to explain clearly why it's the first and not the second.
Please note that I've been studying ways to have pg_stat_statements
being plugged in directly with the shared pgstat APIs to get it backed
by a dshash to give more flexibility and scaling, giving a way for
extensions to register their own stats kind. In this case, the flush
of the stats would be controlled with a callback in the stats
registered by the extensions, conflicting with what's proposed here.
pg_stat_statements is all about stats, at the end. I don't want this
argument to act as a barrier if a checkpoint hook is an accepted
consensus here, but a checkpoint hook used for this code path is not
the most intuitive solution I can think of in the long-term.
On the topic of concrete uses for this API: We have a bunch of built-in
resource managers that could be refactored to use this API.
CheckPointGuts currently looks like this:
/*
* Flush all data in shared memory to disk, and fsync
*
* This is the common code shared between regular checkpoints and
* recovery restartpoints.
*/
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
CheckPointRelationMap();
CheckPointReplicationSlots(flags & CHECKPOINT_IS_SHUTDOWN);
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
RmgrCheckpoint(flags, RMGR_CHECKPOINT_BEFORE_BUFFERS);
CheckPointBuffers(flags);
RmgrCheckpoint(flags, RMGR_CHECKPOINT_AFTER_BUFFERS);
/* Perform all queued up fsyncs */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
ProcessSyncRequests();
CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
}
Of these calls, CheckPointCLOG would be natural as the rmgr_callback of
the clog rmgr. Similarly for CheckPointMultiXact and maybe a few others.
However, let's look at the pg_stat_statements patch (pgss_001.v1.patch):
It's now writing a new WAL record for every call to pgss_store(),
turning even simple queries into WAL-logged operations. That's a
non-starter. It will also not work on a standby. This needs to be
redesigned so that the data is updated in memory, and written to disk
and/or WAL-logged only periodically. Perhaps at checkpoints, but you
could do it more frequently too.
I'm not convinced that the stats should be WAL-logged. Do you want them
to be replicated and included in backups? Maybe, but maybe not. It's
certainly a change to how it currently works.
If we don't WAL-log the stats, we don't really need a custom RMGR for
it. We just need a checkpoint hook to flush the stats to disk, but we
don't need a registered RMGR ID for it.
So, I got a feeling that adding this to the rmgr interface is not quite
right. The rmgr callbacks are for things that run when WAL is
*replayed*, while checkpoints are related to how WAL is generated. Let's
design this as an independent hook, separate from rmgrs.
Another data point: In Neon, we actually had to add a little code to
checkpoints, to WAL-log some exta data. That was a quick hack and might
not be the right design in the first place, but these hooks would not
have been useful for our purposes. We wanted to write a new WAL record
at shutdown, and in CheckPointGuts(), it's already too late for that. It
needs to be done earlier, before starting to the shutdown checkpoint.
Similar to LogStandbySnapshot(), except that LogStandbySnapshot() is not
called at shutdown like we wanted to. For a table AM, the point of a
checkpoint hook is probably to fsync() data that is managed outside of
the normal buffer manager and CheckPointGuts() is the right place for
that, but other extensions might want to hook into checkpoints for other
reasons.
--
Heikki Linnakangas
Neon (https://neon.tech)