Greetings, * David Steele (da...@pgmasters.net) wrote: > On 11/20/20 2:28 AM, Michael Paquier wrote: > >On Mon, Nov 16, 2020 at 11:41:51AM +0100, Magnus Hagander wrote: > >>I was referring to the latest patch on the thread. But as I said, I have > >>not read up on all the different issues raised in the thread, so take it > >>with a big grain os salt. > >> > >>And I would also echo the previous comment that this code was adapted from > >>what the pgbackrest folks do. As such, it would be good to get a comment > >>from for example David on that -- I don't see any of them having commented > >>after that was mentioned? > > > >Agreed. I am adding Stephen as well in CC. From the code of > >backrest, the same logic happens in src/command/backup/pageChecksum.c > >(see pageChecksumProcess), where two checks on pd_upper and pd_lsn > >happen before verifying the checksum. So, if the page header finishes > >with random junk because of some kind of corruption, even corrupted > >pages would be incorrectly considered as correct if the random data > >passes the pd_upper and pg_lsn checks :/ > > Indeed, this is not good, as Andres pointed out some time ago. My apologies > for not getting to this sooner.
Yeah, it's been on our backlog to improve this. > Our current plan for pgBackRest: > > 1) Remove the LSN check as you have done in your patch and when rechecking > see if the page has become valid *or* the LSN is ascending. > 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it > is valid. Yup, that's my recollection also as to our plans for how to improve things here. > These do completely rule out any type of corruption, but they certainly > narrows the possibility by a lot. *don't :) > In the future we would also like to scan the WAL to verify that the page is > definitely being written to. Yeah, that'd certainly be nice to do too. > As for your patch, it mostly looks good but my objection is that a page may > be reported as invalid after 5 retries when in fact it may just be very hot. Yeah.. while unlikely that it'd actually get written out that much, it does seem at least possible. > Maybe checking for an ascending LSN is a good idea there as well? At least > in that case we could issue a different warning, instead of "checksum > verification failed" perhaps "checksum verification skipped due to > concurrent modifications". +1. Thanks, Stephen
signature.asc
Description: PGP signature