On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote: > Actually, after thinking about that a bit more: why is there an LSN-based > special condition at all? It seems like it'd be far more useful to > checksum everything, and on failure try to re-read and re-verify the page > once or twice, so as to handle the corner case where we examine a page > that's in process of being overwritten.
I was reviewing this area today, and that actually matches my impression. Why do we need a LSN-based check at all? As said upthread, that's of course weak with random data as we would miss most of the real checksum failures, with odds getting better depending on the current LSN of the cluster moving on. However, it seems to me that we would have an extra advantage in removing this check all together: it would be possible to check for pages even if these are more recent than the start LSN of the backup, and that could be a lot of pages that could be checked on a large cluster. So by keeping this check we also delay the detection of real problems. As things stand, I'd like to think that it would be much more useful to remove this check and to have one or two extra retries (the current code only has one). I don't like much the possibility of false positives for such critical checks, but as we need to live with what has been released, that looks like a good move for stable branches. -- Michael
signature.asc
Description: PGP signature