Re: Online verification of checksums

David Steele Tue, 09 Mar 2021 09:44:08 -0800

On 11/30/20 6:38 PM, David Steele wrote:

On 11/30/20 9:27 AM, Stephen Frost wrote:
* Michael Paquier ([email protected]) wrote:
On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote:
* Magnus Hagander ([email protected]) wrote:
On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier<[email protected]> wrote:
But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.
Of course that could be the case, but it gets to be a smaller and
smaller chance by checking that the LSN read falls within reasonable
bounds.
FWIW, I find that scary.
There's ultimately different levels of 'scary' and the risk here that
something is actually wrong following these checks strikes me as being
on the same order as random bits being flipped in the page and still
getting a valid checksum (which is entirely possible with our current
checksum...), or maybe even less.
I would say a lot less. First you'd need to corrupt one of the eightbytes that make up the LSN (pretty likely since corruption will probablyaffect the entire block) and then it would need to be updated to a valuethat falls within the current backup range, a 1 in 16 million chance ifa terabyte of WAL is generated during the backup. Plus, the corruptionneeds to happen during the backup since we are going to check for that,and the corrupted LSN needs to be ascending, and the LSN originally readneeds to be within the backup range (another 1 in 16 million chance)since pages written before the start backup checkpoint should not be torn.
So as far as I can see there are more likely to be false negatives fromthe checksum itself.
It would also be easy to add a few rounds of checks, i.e. test if theLSN ascends but stays in the backup LSN range N times.
Honestly, I'm much more worried about corruption zeroing the entirepage. I don't know how likely that is, but I know none of our proposedsolutions would catch it.
Andres, since you brought this issue up originally perhaps you'd like toweigh in?

I had another look at this patch and though I think my suggestions abovewould improve the patch, I have no objections to going forward as is (ifthat is the consensus) since this seems an improvement over what we havenow.

It comes down to whether you prefer false negatives or false positives.With the LSN checking Stephen and I advocate it is theoreticallypossible to have a false negative but the chances of the LSN ascending Ntimes but staying within the backup LSN range due to corruption seems tobe approaching zero.

I think Michael's method is unlikely to throw false positives, but itseems at least possible that a block would be hot enough to be appeartorn N times in a row. Torn pages themselves are really easy to reproduce.

If we do go forward with this method I would likely propose theLSN-based approach as a future improvement.


Regards,
--
-David
[email protected]

Re: Online verification of checksums

Reply via email to