On 2019-03-06 12:33:49 -0500, Robert Haas wrote: > On Sat, Mar 2, 2019 at 5:45 AM Michael Banck <michael.ba...@credativ.de> > wrote: > > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas: > > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck > > > <michael.ba...@credativ.de> wrote: > > > > I have added a retry for this as well now, without a pg_sleep() as well. > > > > This catches around 80% of the half-reads, but a few slip through. At > > > > that point we bail out with exit(1), and the user can try again, which I > > > > think is fine? > > > > > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound > > > robust at all. > > > > The chance that pg_verify_checksums hits a torn page (at least in my > > tests, see below) is already pretty low, a couple of times per 1000 > > runs. Maybe 4 out 5 times, the page is read fine on retry and we march > > on. Otherwise, we now just issue a warning and skip the file (or so was > > the idea, see below), do you think that is not acceptable? > > Yeah. Consider a paranoid customer with 100 clusters who runs this > every day on every cluster. They're going to see failures every day > or three and go ballistic.
+1 > I suspect that better retry logic might help here. I mean, I would > guess that 10 retries at 1 second intervals or something of that sort > would be enough to virtually eliminate false positives while still > allowing us to report persistent -- and thus real -- problems. But if > even that is going to produce false positives with any measurable > probability different from zero, then I think we have a problem, > because I neither like a verification tool that ignores possible signs > of trouble nor one that "cries wolf" when things are fine. To me the right way seems to be to IO lock the page via PG after such a failure, and then retry. Which should be relatively easily doable for the basebackup case, but obviously harder for the pg_verify_checksums case. Greetings, Andres Freund