On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote: > David Steele (da...@pgmasters.net) wrote: >> Our current plan for pgBackRest: >> >> 1) Remove the LSN check as you have done in your patch and when rechecking >> see if the page has become valid *or* the LSN is ascending. >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it >> is valid. > > Yup, that's my recollection also as to our plans for how to improve > things here. > >> These do completely rule out any type of corruption, but they certainly >> narrows the possibility by a lot. > > *don't :)
Have you considered the possibility of only using pd_checksums for the validation? This is the only source of truth in the page header we can rely on to validate the full contents of the page, so if the logic relies on anything but the checksum then you expose the logic to risks of reporting pages as corrupted while they were just torn, or just miss corrupted pages, which is what we should avoid for such things. Both are bad. >> As for your patch, it mostly looks good but my objection is that a page may >> be reported as invalid after 5 retries when in fact it may just be very hot. > > Yeah.. while unlikely that it'd actually get written out that much, it > does seem at least possible. > >> Maybe checking for an ascending LSN is a good idea there as well? At least >> in that case we could issue a different warning, instead of "checksum >> verification failed" perhaps "checksum verification skipped due to >> concurrent modifications". > > +1. I don't quite understand how you can make sure that the page is not corrupted here? It could be possible that the last 4kB of a 8kB page got corrupted, where the header had valid data but failing the checksum verification. So if you are not careful you could have at hand a corrupted page discarded because of it failed the retry multiple times in a row. The only method I can think as being really reliable is based on two facts: - Do a check only on pd_checksums, as that validates the full contents of the page. - When doing a retry, make sure that there is no concurrent I/O activity in the shared buffers. This requires an API we don't have yet. -- Michael
signature.asc
Description: PGP signature