Offline activation of checksums via standby switchover (was: Online checksums patch - once again)

Michael Banck Wed, 10 Feb 2021 00:07:00 -0800

Hi,

Am Mittwoch, den 10.02.2021, 15:06 +0900 schrieb Michael Paquier:
> On Tue, Feb 09, 2021 at 10:54:50AM +0200, Heikki Linnakangas wrote:
> > (I may have said this before, but) My overall high-level impression of this
> > patch is that it's really cmmplex for a feature that you use maybe once in
> > the lifetime of a cluster. I'm happy to review but I'm not planning to
> > commit this myself. I don't object if some other committer picks this up
> > (Magnus?).
> 
> I was just looking at the latest patch set as a matter of curiosity,
> and I have a shared feeling.


I think this still would be a useful feature; not least for the online
deactivation - having to shut down the instance is sometimes just not an
option in production, even for just a few seconds.

However, there is also the shoot-the-whole-database-into-WAL (at least,
that is what happens, AIUI) issue which has not been discussed that much
either, the patch allows throttling, but I think the impact on actual
production workloads are not very clear yet.

> I think that this is a lot of complication in-core for what would be a
> one-time operation, particularly knowing that there are other ways to
> do it already with the offline checksum tool, even if that is more
> costly: 
> - Involve logical replication after initializing the new instance with
> --data-checksums, or in an upgrade scenatio with pg_upgrade.

Logical replication is still somewhat unpractical for such a (possibly)
routine task, and I don't understand your pg_upgrade scenario, can
expand on that a bit?

> - Involve physical replication: stop the standby cleanly, enable
> checksums on it and do a switchover.

I would like to focus on this, so I changed the subject in order not to
derail the online acivation patch thread.

If this is something we support, then we should document it.

I have to admit that this possiblity escaped me when we first committed
offline (de)activation, it was brought to my attention via 
https://twitter.com/samokhvalov/status/1281312586219188224 and the
following discussion.

So if we think this (to recap: shut down the standby, run pg_checksums
on it, start it up again, wait until it is back in sync, then
switchover) is a safe way to activate checksums on a streaming
replication setup, then we should document it I think. However, I have
only seen sorta hand-waiving on this so far and no deeper analysis of
what could possibly go wrong (but doesn't).

Anybody did some further work/tests on this and/or has something written
up to contribute to the documentation? Or do we think this is not
appropriate to document? I think once we agree this is safe, it is not
more complicated than the rsync-the-standby-after-pg_upgrade recipe we
did document.

> Another thing we could do is to improve pg_checksums with a parallel
> mode.  The main design question would be how to distribute the I/O,
> and that would mean balancing at least across tablespaces.

Right. I thought about this a while ago, but didn't have time to work on
it so far.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Offline activation of checksums via standby switchover (was: Online checksums patch - once again)

Reply via email to