On Tue, Mar 9, 2021 at 10:43 PM David Steele <da...@pgmasters.net> wrote:
> On 11/30/20 6:38 PM, David Steele wrote: > > On 11/30/20 9:27 AM, Stephen Frost wrote: > >> * Michael Paquier (mich...@paquier.xyz) wrote: > >>> On Fri, Nov 27, 2020 at 11:15:27AM -0500, Stephen Frost wrote: > >>>> * Magnus Hagander (mag...@hagander.net) wrote: > >>>>> On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier > >>>>> <mich...@paquier.xyz> wrote: > >>>>>> But here the checksum is broken, so while the offset is something we > >>>>>> can rely on how do you make sure that the LSN is fine? A broken > >>>>>> checksum could perfectly mean that the LSN is actually *not* fine if > >>>>>> the page header got corrupted. > >>>> > >>>> Of course that could be the case, but it gets to be a smaller and > >>>> smaller chance by checking that the LSN read falls within reasonable > >>>> bounds. > >>> > >>> FWIW, I find that scary. > >> > >> There's ultimately different levels of 'scary' and the risk here that > >> something is actually wrong following these checks strikes me as being > >> on the same order as random bits being flipped in the page and still > >> getting a valid checksum (which is entirely possible with our current > >> checksum...), or maybe even less. > > > > I would say a lot less. First you'd need to corrupt one of the eight > > bytes that make up the LSN (pretty likely since corruption will probably > > affect the entire block) and then it would need to be updated to a value > > that falls within the current backup range, a 1 in 16 million chance if > > a terabyte of WAL is generated during the backup. Plus, the corruption > > needs to happen during the backup since we are going to check for that, > > and the corrupted LSN needs to be ascending, and the LSN originally read > > needs to be within the backup range (another 1 in 16 million chance) > > since pages written before the start backup checkpoint should not be > torn. > > > > So as far as I can see there are more likely to be false negatives from > > the checksum itself. > > > > It would also be easy to add a few rounds of checks, i.e. test if the > > LSN ascends but stays in the backup LSN range N times. > > > > Honestly, I'm much more worried about corruption zeroing the entire > > page. I don't know how likely that is, but I know none of our proposed > > solutions would catch it. > > > > Andres, since you brought this issue up originally perhaps you'd like to > > weigh in? > > I had another look at this patch and though I think my suggestions above > would improve the patch, I have no objections to going forward as is (if > that is the consensus) since this seems an improvement over what we have > now. > > It comes down to whether you prefer false negatives or false positives. > With the LSN checking Stephen and I advocate it is theoretically > possible to have a false negative but the chances of the LSN ascending N > times but staying within the backup LSN range due to corruption seems to > be approaching zero. > > I think Michael's method is unlikely to throw false positives, but it > seems at least possible that a block would be hot enough to be appear > torn N times in a row. Torn pages themselves are really easy to reproduce. > > If we do go forward with this method I would likely propose the > LSN-based approach as a future improvement. > > Regards, > -- > -David > da...@pgmasters.net > > > I am changing the status to "Waiting on Author" based on the latest comments of @David Steele <da...@pgmasters.net> and secondly the patch does not apply cleanly. http://cfbot.cputube.org/patch_33_2719.log -- Ibrar Ahmed