Re: [patch] Fix checksum verification in base backups for zero page headers

Michael Paquier Wed, 21 Oct 2020 23:52:50 -0700

On Thu, Oct 22, 2020 at 02:27:34PM +0800, Julien Rouhaud wrote:
> I'm a bit worried about this approach, as if I understand correctly
> this can lead to false positive reports.  I've certainly seen systems
> with IO stalled for more than 500ms, so while this is not frequent
> this could still happen.


The possibility of false positives is not a new thing with this
feature as currently shaped.  On HEAD, this code actually just does
one retry, without waiting at all for the operation potentially
happening in parallel to finish, so that's actually worse.  And that's
assuming that the pd_lsn of the page did not get touched by a
corruption as we would simply miss a broken page.  So, with a
non-locking approach, we limit ourselves to tweaking the number of
retries and some sleeps :(

I am not sure that increasing the sleep of 100ms is a good thing on
not-so-slow disks, but we could increase the number of retries.  The
patch makes that easier to change at least.  FWIW, I don't like that
this code, with a real risk of false positives, got merged to begin
with, and I think that other people share the same opinion, but it is
not like we can just remove it on a branch already released either..
And I am not sure if we have done such things in the past for stable
branches.  If we were to do that, we could just make the operation a
no-op, and keep some traces of the grammar for compatibility.

> About the patch:
> 
> + * doing this check, causing a false positive.  If that
> + * happens, a page is retried once, with an error reported if
> + * the second attempt also fails.
> 
> [...]
> 
> + /* The verification of a page has failed, retry once */
> + if (block_attempts < PAGE_RETRY_THRESHOLD)
> + {

Oops.  Thanks, I got that changed on my branch.
--
Michael

signature.asc
Description: PGP signature

Re: [patch] Fix checksum verification in base backups for zero page headers

Reply via email to