On Thu, Oct 22, 2020 at 02:27:34PM +0800, Julien Rouhaud wrote: > I'm a bit worried about this approach, as if I understand correctly > this can lead to false positive reports. I've certainly seen systems > with IO stalled for more than 500ms, so while this is not frequent > this could still happen.
The possibility of false positives is not a new thing with this feature as currently shaped. On HEAD, this code actually just does one retry, without waiting at all for the operation potentially happening in parallel to finish, so that's actually worse. And that's assuming that the pd_lsn of the page did not get touched by a corruption as we would simply miss a broken page. So, with a non-locking approach, we limit ourselves to tweaking the number of retries and some sleeps :( I am not sure that increasing the sleep of 100ms is a good thing on not-so-slow disks, but we could increase the number of retries. The patch makes that easier to change at least. FWIW, I don't like that this code, with a real risk of false positives, got merged to begin with, and I think that other people share the same opinion, but it is not like we can just remove it on a branch already released either.. And I am not sure if we have done such things in the past for stable branches. If we were to do that, we could just make the operation a no-op, and keep some traces of the grammar for compatibility. > About the patch: > > + * doing this check, causing a false positive. If that > + * happens, a page is retried once, with an error reported if > + * the second attempt also fails. > > [...] > > + /* The verification of a page has failed, retry once */ > + if (block_attempts < PAGE_RETRY_THRESHOLD) > + { Oops. Thanks, I got that changed on my branch. -- Michael
signature.asc
Description: PGP signature