Re: Online verification of checksums

Ibrar Ahmed Fri, 09 Jul 2021 13:01:17 -0700

On Tue, Mar 9, 2021 at 10:43 PM David Steele <da...@pgmasters.net> wrote:


> On 11/30/20 6:38 PM, David Steele wrote:
> > On 11/30/20 9:27 AM, Stephen Frost wrote:
> >> * Michael Paquier (mich...@paquier.xyz) wrote:
> >>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
> >>>> * Magnus Hagander (mag...@hagander.net) wrote:
> >>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier
> >>>>> <mich...@paquier.xyz> wrote:
> >>>>>> But here the checksum is broken, so while the offset is something we
> >>>>>> can rely on how do you make sure that the LSN is fine?  A broken
> >>>>>> checksum could perfectly mean that the LSN is actually *not* fine if
> >>>>>> the page header got corrupted.
> >>>>
> >>>> Of course that could be the case, but it gets to be a smaller and
> >>>> smaller chance by checking that the LSN read falls within reasonable
> >>>> bounds.
> >>>
> >>> FWIW, I find that scary.
> >>
> >> There's ultimately different levels of 'scary' and the risk here that
> >> something is actually wrong following these checks strikes me as being
> >> on the same order as random bits being flipped in the page and still
> >> getting a valid checksum (which is entirely possible with our current
> >> checksum...), or maybe even less.
> >
> > I would say a lot less. First you'd need to corrupt one of the eight
> > bytes that make up the LSN (pretty likely since corruption will probably
> > affect the entire block) and then it would need to be updated to a value
> > that falls within the current backup range, a 1 in 16 million chance if
> > a terabyte of WAL is generated during the backup. Plus, the corruption
> > needs to happen during the backup since we are going to check for that,
> > and the corrupted LSN needs to be ascending, and the LSN originally read
> > needs to be within the backup range (another 1 in 16 million chance)
> > since pages written before the start backup checkpoint should not be
> torn.
> >
> > So as far as I can see there are more likely to be false negatives from
> > the checksum itself.
> >
> > It would also be easy to add a few rounds of checks, i.e. test if the
> > LSN ascends but stays in the backup LSN range N times.
> >
> > Honestly, I'm much more worried about corruption zeroing the entire
> > page. I don't know how likely that is, but I know none of our proposed
> > solutions would catch it.
> >
> > Andres, since you brought this issue up originally perhaps you'd like to
> > weigh in?
>
> I had another look at this patch and though I think my suggestions above
> would improve the patch, I have no objections to going forward as is (if
> that is the consensus) since this seems an improvement over what we have
> now.
>
> It comes down to whether you prefer false negatives or false positives.
> With the LSN checking Stephen and I advocate it is theoretically
> possible to have a false negative but the chances of the LSN ascending N
> times but staying within the backup LSN range due to corruption seems to
> be approaching zero.
>
> I think Michael's method is unlikely to throw false positives, but it
> seems at least possible that a block would be hot enough to be appear
> torn N times in a row. Torn pages themselves are really easy to reproduce.
>
> If we do go forward with this method I would likely propose the
> LSN-based approach as a future improvement.
>
> Regards,
> --
> -David
> da...@pgmasters.net
>
>
>
I am changing the status to "Waiting on Author" based on the latest
comments of @David Steele <da...@pgmasters.net>
and secondly the patch does not apply cleanly.

http://cfbot.cputube.org/patch_33_2719.log



-- 
Ibrar Ahmed

Re: Online verification of checksums

Reply via email to