Re: Online verification of checksums

Michael Paquier Tue, 20 Oct 2020 02:12:13 -0700

On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> Actually, after thinking about that a bit more: why is there an LSN-based
> special condition at all?  It seems like it'd be far more useful to
> checksum everything, and on failure try to re-read and re-verify the page
> once or twice, so as to handle the corner case where we examine a page
> that's in process of being overwritten.


I was reviewing this area today, and that actually matches my
impression.  Why do we need a LSN-based check at all?  As said
upthread, that's of course weak with random data as we would miss most
of the real checksum failures, with odds getting better depending on
the current LSN of the cluster moving on.  However, it seems to me
that we would have an extra advantage in removing this check
all together: it would be possible to check for pages even if these
are more recent than the start LSN of the backup, and that could be a
lot of pages that could be checked on a large cluster.  So by keeping
this check we also delay the detection of real problems.  As things
stand, I'd like to think that it would be much more useful to remove
this check and to have one or two extra retries (the current code only
has one).  I don't like much the possibility of false positives for
such critical checks, but as we need to live with what has been
released, that looks like a good move for stable branches.
--
Michael

signature.asc
Description: PGP signature

Re: Online verification of checksums

Reply via email to