Hi, On 09/17/2018 06:42 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: >> On 09/17/2018 04:46 PM, Stephen Frost wrote: >>> * Michael Banck (michael.ba...@credativ.de) wrote: >>>> On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote: >>>>> Obviously, pg_verify_checksums can't do that easily because it's >>>>> supposed to run from outside the database instance. >>>> >>>> It reads pg_control anyway, so couldn't we just take >>>> ControlFile->checkPoint? >>>> >>>> Other than that, basebackup.c seems to only look at pages which haven't >>>> been changed since the backup starting checkpoint (see above if >>>> statement). That's reasonable for backups, but is it just as reasonable >>>> for online verification? >>> >>> Right, basebackup doesn't need to look at other pages. >>> >>>>> The pg_verify_checksum apparently ignores this skip logic, because on >>>>> the retry it simply re-reads the page again, verifies the checksum and >>>>> reports an error. Which is broken, because the newly read page might be >>>>> torn again due to a concurrent write. >>>> >>>> Well, ok. >>> >>> The newly read page will have an updated LSN though then on the re-read, >>> in which case basebackup can know that what happened was a rewrite of >>> the page and it no longer has to care about the page and can skip it. >>> >>> I haven't looked, but if basebackup isn't checking the LSN again for the >>> newly read page then that'd be broken, but I believe it does (at least, >>> that's the algorithm we came up with for pgBackRest, and I know David >>> shared that when the basebackup code was being written). >> >> Yes, basebackup does check the LSN on re-read, and skips the page if it >> changed on re-read (because it eliminates the consistency guarantees >> provided by the checkpoint). > > Ok, good, though I'm not sure what you mean by 'eliminates the > consistency guarantees provided by the checkpoint'. The point is that > the page will be in the WAL and the WAL will be replayed during the > restore of the backup. >
The checkpoint guarantees that the whole page was written and flushed to disk with an LSN before the ckeckpoint LSN. So when you read a page with that LSN, you know the whole write already completed and a read won't return data from before the LSN. Without the checkpoint that's not guaranteed, and simply re-reading the page and rechecking it vs. the first read does not help: 1) write the first 512B of the page (sector), which includes the LSN 2) read the whole page, which will be a mix [new 512B, ... old ... ] 3) the checksum verification fails 4) read the page again (possibly reading a bit more new data) 5) the LSN did not change compared to the first read, yet the checksum still fails >>>> So how about we do check every page, but if one fails on retry, and the >>>> LSN is newer than the checkpoint, we then skip it? Is that logic sound? >>> >>> I thought that's what basebackup did- if it doesn't do that today, then >>> it really should. >> >> The crucial distinction here is that the trick is not in comparing LSNs >> from the two page reads, but comparing it to the checkpoint LSN. If it's >> greater, the page may be torn or broken, and there's no way to know >> which case it is - so basebackup simply skips it. > > Sure, because we don't care about it any longer- that page isn't > interesting because the WAL will replay over it. IIRC it actually goes > something like: check the checksum, if it failed then check if the LSN > is greater than the checkpoint (of the backup start..), if not, then > re-read, if the LSN is now newer than the checkpoint then skip, if the > LSN is the same then throw an error. > Nope, we only verify the checksum if it's LSN precedes the checkpoint: https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454 >>>> In any case, if we decide we really should skip the page if it is newer >>>> than the checkpoint, I think it makes sense to track those skipped pages >>>> and print their number out at the end, if there are any. >>> >>> Not sure what the point of this is. If we wanted to really do something >>> to cross-check here, we'd track the pages that were skipped and then >>> look through the WAL to make sure that they're there. That's something >>> we've talked about doing with pgBackRest, but don't currently. >> >> I agree simply printing the page numbers seems rather useless. What we >> could do is remember which pages we skipped and then try checking them >> after another checkpoint. Or something like that. > > I'm still not sure I'm seeing the point of that. They're still going to > almost certainly be in the kernel cache. The reason for checking > against the WAL would be to detect errors in PG where we aren't putting > a page into the WAL when it really should be, or something similar, > which seems like it at least could be useful. > > Maybe to put it another way- there's very little point in checking the > checksum of a page which we know must be re-written during recovery to > get to a consistent point. I don't think it hurts in the general case, > but I wouldn't write a lot of code which then needs to be tested to > handle it. I also don't think that we really need to make > pg_verify_checksum spend lots of extra cycles trying to verify that > *every* page had its checksum validated when we know that lots of pages > are going to be in memory marked dirty and our checking of them will be > ultimately pointless as they'll either be written out by the > checkpointer or some other process, or we'll replay them from the WAL if > we crash. > Yeah, I agree. >>>>> FWIW I also don't understand the purpose of pg_sleep(), it does not seem >>>>> to protect against anything, really. >>>> >>>> Well, I've noticed that without it I get sporadic checksum failures on >>>> reread, so I've added it to make them go away. It was certainly a >>>> phenomenological decision that I am happy to trade for a better one. >>> >>> That then sounds like we really aren't re-checking the LSN, and we >>> really should be, to avoid getting these sporadic checksum failures on >>> reread.. >> >> Again, it's not enough to check the LSN against the preceding read. We >> need a checkpoint LSN or something like that. > > I actually tend to disagree with you that, for this purpose, it's > actually necessary to check against the checkpoint LSN- if the LSN > changed and everything is operating correctly then the new LSN must be > more recent than the last checkpoint location or things are broken > badly. > I don't follow. Are you suggesting we don't need the checkpoint LSN? I'm pretty sure that's not the case. The thing is - the LSN may not change between the two reads, but that's not a guarantee the page was not torn. The example I posted earlier in this message illustrates that. > Now, that said, I do think it's a good *idea* to check against the > checkpoint LSN (presuming this is for online checking of checksums- for > basebackup, we could just check against the backup-start LSN as anything > after that point will be rewritten by WAL anyway). The reason that I > think it's a good idea to check against the checkpoint LSN is that we'd > want to throw a big warning if the kernel is just feeding us random > garbage on reads and only finding a difference between two reads isn't > really doing any kind of validation, whereas checking against the > checkpoint-LSN would at least give us some idea that the value being > read isn't completely ridiculous. > > When it comes to if the pg_sleep() is necessary or not, I have to admit > to being unsure about that.. I could see how it might be but it seems a > bit surprising- I'd probably want to see exactly what the page was at > the time of the failure and at the time of the second (no-sleep) re-read > and then after a delay and convince myself that it was just an unlucky > case of being scheduled in twice to read that page before the process > writing it out got a chance to finish the write. > I think the pg_sleep() is a pretty strong sign there's something broken. At the very least, it's likely to misbehave on machines with different timings, machines under memory and/or memory pressure, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services