Hi Daniel, I took a look at the patch again, focusing mostly on the docs and comments. I think the code is OK, I haven't noticed anything serious.
testing ------- I'm running the TAP tests - so far it looks fine, I've done 2000 iterations of the "single" test, now running ~2000 iterations of the "standby" test. No issues/failures so far. The question is whether we can/should make it even more "random", by doing restarts in more places etc. I might give it a try, if I happen to have some free time. But no promises, I'm not sure how feasible it is. Making it "truly random" means it's hard to deduce what should be the actual state of checksums, etc. review ------ Attached is a patch adding FIXME/XXX comments to a bunch of places, which I think makes it clearer which place I'm talking about. I'll briefly go over the items, and maybe explain them a bit more. 1) func-admin.sgml - This is missing documentation for the "fast" parameters for both functions (enable/disable). - The paragraph stars with "Initiates data checksums for ..." but that doesn't sound right to me. I think you can initiate enabling/disabling, not "data checksums". - I think the docs should at least briefly describe the impact on the cluster, and also on a standby, due to having to write everything into WAL, waiting for checkpoints etc. And maybe discuss how to mitigate that in some way. More about the standby stuff later. 2) glossary.sgml - This describes "checksum worker" as process that enables or disables checksums in a specific database, but we don't need any per-database processes when disabling, no? 3) monitoring.sgml - I'm not sure what "regardles" implies here. Does that mean we simply don't hide/reset the counters? - I added a brief explanation about using the "launcher" row for overall progress, and per-database workers for "current progress". - Do we want to refer to "datachecksumsworker"? Isn't it a bit too low-level detail? - The table of phases is missing "waiting" and "done" phases. IMHO if the progress view can return it, it should be in the docs. 4) wal.sgml - I added a sentence explaining that both enabling and disabling happens in phases, with checkpoints in between. I think that's helpful for users and DBAs. - The section only described "enabling checksums", but it probably should explain the process for disabling too. Added a para. - Again, I think we should explain the checkpoints, restartpoints, impact on standby ... somewhere. Not sure if this is the right place. 5) xlog.c - Some minor corrections (typos, ...). - Isn't the claim that PG_DATA_CHECKSUM_ON_VERSION is the only place where we check InitialDataChecksumTransition stale? AFAIK we now check this in AbsorbDataChecksumsBarrier for all transitions, no? 6) datachecksumsworker.c - I understand what the comments at the beginning of the file say, but I suspect it's partially due to already knowing the code. There's a couple places that might be more explicit, I think. For example: - One of the items in the synchronization/correctness section states that "Backends SHALL NOT violate local data_checksums state" but what does "violating local data_checksums state" mean? What even is "local state in this context"? Should we explain/define that, or would that be unnecessarily detailed? - The section only talks about "enabling checksums" but also discusses all four possible states. Maybe it should talk about disabling too, as that requires the same synchronization/correctness. - Maybe it'd be good make it more explicit at which point the process waits on a barrier, which backend initiates that (and which backends are required to absorb the signal). It kinda is there, but only indirectly. - Another idea I had is that maybe it'd help to have some visualization of the process (with data_checksum states, barriers, ...) e.g. in the form of an ASCII image. open questions -------------- For me the main remaining question is impact people should expect in production systems, and maybe ways to mitigate that. In single-node systems this is entirely fine, I think. There will be checkpoints, but those will be "spread" and it's just the checksum worker waiting for that to complete. It'll write everything into WAL, but that's fairly well understood / should be expected. We should probably mention that in the sgml docs, so that people are not surprised their WAL archive gets huge. I'm much more concerned about streaming replicas, because there it forces a restart point - and it *blocks redo* until it completes. Which means there'll be replication lag, and for synchronous standbys this would even block progress on the primary. We should very clearly document this. But I'm also wondering if we might mitigate this in some way / reduce the impact. I see some similarities to shutdown checkpoints, which can take a lot of time if there happen to be a lot of dirty data, increasing disruption during planned restarts (when no one can connect). A common mitigation is to run CHECKPOINT shortly before the restart, to flush most of the dirty data while still allowing new connections. Maybe we could do something like that for checksum changes? I don't know how exactly we could do that, but let's say we can predict when roughly to expect the next state change. And we'd ensure the standby starts flushing stuff before that, so that creating the restartpoint is cheap. Or maybe we'd (gradually?) lower max_wal_size on the standby, to reduce the amount of WAL as we're getting closer to the end? regards -- Tomas Vondra
From 40d26a0406a213207101a837999e8b57df25fc1d Mon Sep 17 00:00:00 2001 From: Tomas Vondra <[email protected]> Date: Wed, 5 Nov 2025 13:26:41 +0100 Subject: [PATCH v20251105 2/3] review --- doc/src/sgml/func/func-admin.sgml | 20 ++++++++++++----- doc/src/sgml/glossary.sgml | 2 +- doc/src/sgml/monitoring.sgml | 23 ++++++++++++++++++-- doc/src/sgml/wal.sgml | 18 +++++++++++++++ src/backend/access/transam/xlog.c | 17 +++++++++------ src/backend/postmaster/datachecksumsworker.c | 14 ++++++++++++ src/backend/storage/ipc/ipci.c | 1 - src/include/postmaster/datachecksumsworker.h | 2 +- src/tools/pgindent/typedefs.list | 2 ++ 9 files changed, 82 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/func/func-admin.sgml b/doc/src/sgml/func/func-admin.sgml index f3a8782ede0..4d6f6a7e486 100644 --- a/doc/src/sgml/func/func-admin.sgml +++ b/doc/src/sgml/func/func-admin.sgml @@ -3008,11 +3008,11 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); <indexterm> <primary>pg_enable_data_checksums</primary> </indexterm> - <function>pg_enable_data_checksums</function> ( <optional><parameter>cost_delay</parameter> <type>int</type>, <parameter>cost_limit</parameter> <type>int</type></optional> ) + <function>pg_enable_data_checksums</function> ( <optional><parameter>cost_delay</parameter> <type>int</type>, <parameter>cost_limit</parameter> <type>int</type></optional>, <parameter>fast</parameter> <type>bool</type></optional> ) <returnvalue>void</returnvalue> </para> <para> - Initiates data checksums for the cluster. This will switch the data + Initiates the process of enabling data checksums for the cluster. This will switch the data checksums mode to <literal>inprogress-on</literal> as well as start a background worker that will process all pages in the database and enable checksums on them. When all data pages have had checksums @@ -3023,7 +3023,9 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); If <parameter>cost_delay</parameter> and <parameter>cost_limit</parameter> are specified, the speed of the process is throttled using the same principles as <link linkend="runtime-config-resource-vacuum-cost">Cost-based Vacuum Delay</link>. - </para></entry> + </para> + <para>FIXME missing documentation of the "fast" parameter</para> + </entry> </row> <row> @@ -3031,7 +3033,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); <indexterm> <primary>pg_disable_data_checksums</primary> </indexterm> - <function>pg_disable_data_checksums</function> () + <function>pg_disable_data_checksums</function> ( <parameter>fast</parameter> <type>bool</type></optional> ) <returnvalue>void</returnvalue> </para> <para> @@ -3042,12 +3044,20 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8'); changed to <literal>off</literal>. At this point the data pages will still have checksums recorded but they are not updated when pages are modified. - </para></entry> + </para> + <para>FIXME missing documentation of the "fast" parameter</para> + </entry> </row> </tbody> </tgroup> </table> + <para> + FIXME I think this should briefly explain how this interacts with checkpoints (through + the fast parameters). It should probably also discuss how this affects streaming standby + due to forcing a restart point, etc. And maybe comment on possible mitigations? + </para> + </sect2> </sect1> diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml index 9bac0c96348..3ba0e8c6c5c 100644 --- a/doc/src/sgml/glossary.sgml +++ b/doc/src/sgml/glossary.sgml @@ -580,7 +580,7 @@ <glossdef> <para> An <glossterm linkend="glossary-auxiliary-proc">auxiliary process</glossterm> - which enables or disables data checksums in a specific database. + which enables data checksums in a specific database. </para> </glossdef> </glossentry> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index b56e220f3d8..7efa1af746a 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -3554,7 +3554,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage database (or on a shared object). Detected failures are reported regardless of the <xref linkend="guc-data-checksums"/> setting. - </para></entry> + </para> + <para>XXX I'm not sure what "regardless" means in this context. I guess + it means we don't reset the counters when disabling checksums?</para> + </entry> </row> <row> @@ -6959,6 +6962,9 @@ FROM pg_stat_get_backend_idset() AS backendid; <structname>pg_stat_progress_data_checksums</structname> view will contain a row for the launcher process, and one row for each worker process which is currently calculating checksums for the data pages in one database. + The launcher provides overview of the overall progress (how many database + have been processed, how many remain), while the workers track progress for + currently processed databases. </para> <table id="pg-stat-progress-data-checksums-view" xreflabel="pg_stat_progress_data_checksums"> @@ -6984,7 +6990,8 @@ FROM pg_stat_get_backend_idset() AS backendid; <structfield>pid</structfield> <type>integer</type> </para> <para> - Process ID of a datachecksumworker process. + Process ID of a datachecksumsworker process. + FIXME Is datachecksumsworker defined somewhere? Does it refer to the launcher too? </para> </entry> </row> @@ -7127,6 +7134,12 @@ FROM pg_stat_get_backend_idset() AS backendid; The command is currently disabling data checksums on the cluster. </entry> </row> + <row> + <entry><literal>waiting</literal></entry> + <entry> + FIXME + </entry> + </row> <row> <entry><literal>waiting on temporary tables</literal></entry> <entry> @@ -7141,6 +7154,12 @@ FROM pg_stat_get_backend_idset() AS backendid; state before finishing. </entry> </row> + <row> + <entry><literal>done</literal></entry> + <entry> + FIXME + </entry> + </row> </tbody> </tgroup> </table> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml index 0ada90ca0b1..8ef16d7769f 100644 --- a/doc/src/sgml/wal.sgml +++ b/doc/src/sgml/wal.sgml @@ -284,6 +284,11 @@ <link linkend="functions-admin-checksum">functions</link>. </para> + <para> + Both enabling and disabling checksums happens in two phases, separated by + a checkpoint to ensure durability. + </para> + <para> Enabling checksums will put the cluster checksum mode in <literal>inprogress-on</literal> mode. During this time, checksums will be @@ -314,6 +319,14 @@ no support for resuming work from where it was interrupted. </para> + <para> + Disabling checksums will put the cluster checksum mode in + <literal>inprogress-off</literal> mode. During this time, checksums will be + written but not verified. After all processes acknowledge the change, + the mode will automatically switch to <literal>off</literal>. In this case + rewriting all blocks is not needed, but checkpoints are still required. + </para> + <note> <para> Enabling checksums can cause significant I/O to the system, as most of the @@ -324,6 +337,11 @@ </para> </note> + <para> + XXX Maybe this is the place that should mention checkpoints/restartpoints, + how it impacts systems/replication and how to mitigate that? + </para> + </sect2> </sect1> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f7633f47551..807b82eed4f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -607,7 +607,7 @@ typedef struct ChecksumBarrierCondition int barrier_ne[MAX_BARRIER_CONDITIONS]; /* The number of elements in the barrier_ne set */ int barrier_ne_sz; -} ChecksumBarrierCondition; +} ChecksumBarrierCondition; static const ChecksumBarrierCondition checksum_barriers[] = { @@ -618,7 +618,6 @@ static const ChecksumBarrierCondition checksum_barriers[] = {-1} }; - /* * Calculate the amount of space left on the page after 'endptr'. Beware * multiple evaluation! @@ -694,10 +693,10 @@ static TimeLineID LocalMinRecoveryPointTLI; static bool updateMinRecoveryPoint = true; /* - * Local state fror Controlfile data_checksum_version. After initialization + * Local state for Controlfile data_checksum_version. After initialization * this is only updated when absorbing a procsignal barrier during interrupt * processing. The reason for keeping a copy in backend-private memory is to - * avoid locking for interrogating checksum state. Possible values are the + * avoid locking for interrogating the checksum state. Possible values are the * checksum versions defined in storage/bufpage.h as well as zero when data * checksums are disabled. */ @@ -712,6 +711,10 @@ static uint32 LocalDataChecksumVersion = 0; * processing the barrier. This may happen if the process is spawned between * the update of XLogCtl->data_checksum_version and the barrier being emitted. * This can only happen on the very first barrier so mark that with this flag. + * + * XXX Is PG_DATA_CHECKSUM_ON_VERSION still the only transition with an assert? + * I think it was replaced by checking checksum_barriers for every transition, + * with elog(ERROR), no? */ static bool InitialDataChecksumTransition = true; @@ -4335,7 +4338,7 @@ InitControlFile(uint64 sysidentifier, uint32 data_checksum_version) /* * Set the data_checksum_version value into XLogCtl, which is where all - * processes get the current value from. (Maybe it should go just there?) + * processes get the current value from. */ XLogCtl->data_checksum_version = data_checksum_version; } @@ -4921,7 +4924,7 @@ SetDataChecksumsOff(bool immediate_checkpoint) uint64 barrier; int flags; - Assert(ControlFile); + Assert(ControlFile != NULL); SpinLockAcquire(&XLogCtl->info_lck); @@ -5030,7 +5033,7 @@ SetDataChecksumsOff(bool immediate_checkpoint) bool AbsorbDataChecksumsBarrier(int target_state) { - const ChecksumBarrierCondition *condition = checksum_barriers; + const ChecksumBarrierCondition *condition = checksum_barriers; int current = LocalDataChecksumVersion; bool found = false; diff --git a/src/backend/postmaster/datachecksumsworker.c b/src/backend/postmaster/datachecksumsworker.c index 3deb57a96de..67045b9014d 100644 --- a/src/backend/postmaster/datachecksumsworker.c +++ b/src/backend/postmaster/datachecksumsworker.c @@ -52,6 +52,9 @@ * - Data checksums SHALL NOT be considered enabled cluster-wide until all * currently connected backends have the local state "enabled" * + * FIXME I'm not 100% sure I understand what the two above points say. What does + * "violate local data_checksums state" means"? + * * There are two levels of synchronization required for enabling data checksums * in an online cluster: (i) changing state in the active backends ("on", * "off", "inprogress-on" and "inprogress-off"), and (ii) ensuring no @@ -60,6 +63,10 @@ * latter with ensuring that any concurrent activity cannot break the data * checksum contract during processing. * + * FIXME This para talks about "enabling" but then mentions all four states, + * including "inprogress-off" and "off". Maybe it should talk about "changing + * data_checksums" instead? + * * Synchronizing the state change is done with procsignal barriers, where the * WAL logging backend updating the global state in the controlfile will wait * for all other backends to absorb the barrier. Barrier absorption will happen @@ -88,6 +95,10 @@ * enables data checksums cluster-wide, there are four sets of backends where * Bd shall be an empty set: * + * FIXME Maybe mention which process initiates the procsignalbarrier? + * FIXME Don't we actually wait or the barrier before we start rewriting data? + * I think Bd has to be empty prior to that, otherwise it might break checksums. + * * Bg: Backend updating the global state and emitting the procsignalbarrier * Bd: Backends in "off" state * Be: Backends in "on" state @@ -124,6 +135,8 @@ * stop writing data checksums as no backend is enforcing data checksum * validation any longer. * + * XXX Maybe it'd make sense to "visualize" the progress between the states + * and barriers in some way. Say, by doing * * Potential optimizations * ----------------------- @@ -148,6 +161,7 @@ * to enable checksums on a cluster which is in inprogress-on state and * may have checksummed pages (make pg_checksums be able to resume an * online operation). + * * Restartability (not necessarily with page granularity). * * * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 44213d140ae..9014e90f1c7 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -31,7 +31,6 @@ #include "postmaster/bgworker_internals.h" #include "postmaster/bgwriter.h" #include "postmaster/datachecksumsworker.h" -#include "postmaster/postmaster.h" #include "postmaster/walsummarizer.h" #include "replication/logicallauncher.h" #include "replication/origin.h" diff --git a/src/include/postmaster/datachecksumsworker.h b/src/include/postmaster/datachecksumsworker.h index 2cd066fd0fe..0daef709ec8 100644 --- a/src/include/postmaster/datachecksumsworker.h +++ b/src/include/postmaster/datachecksumsworker.h @@ -24,7 +24,7 @@ typedef enum DataChecksumsWorkerOperation ENABLE_DATACHECKSUMS, DISABLE_DATACHECKSUMS, /* TODO: VERIFY_DATACHECKSUMS, */ -} DataChecksumsWorkerOperation; +} DataChecksumsWorkerOperation; /* * Possible states for a database entry which has been processed. Exported diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 9d6b4e57cf3..3049e6018b3 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -417,6 +417,7 @@ CheckPointStmt CheckpointStatsData CheckpointerRequest CheckpointerShmemStruct +ChecksumBarrierCondition ChecksumType Chromosome CkptSortItem @@ -587,6 +588,7 @@ CustomScan CustomScanMethods CustomScanState CycleCtr +DataChecksumsWorkerOperation DBState DCHCacheEntry DEADLOCK_INFO -- 2.51.0
