On Thu, Nov 26, 2020 at 8:42 AM Michael Paquier <mich...@paquier.xyz> wrote: > > On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote: > > We are not just looking at one LSN value. Here are the steps we are > > proposing (I'll skip checks for zero pages here): > > > > 1) Test the page checksum. If it passes the page is OK. > > 2) If the checksum does not pass then record the page offset and LSN and > > continue. > > But here the checksum is broken, so while the offset is something we > can rely on how do you make sure that the LSN is fine? A broken > checksum could perfectly mean that the LSN is actually *not* fine if > the page header got corrupted. > > > 3) After the file is copied, reopen and reread the file, seeking to offsets > > where possible invalid pages were recorded in the first pass. > > a) If the page is now valid then it is OK. > > b) If the page is not valid but the LSN has increased from the LSN > > Per se the previous point about the LSN value that we cannot rely on.
We cannot rely on the LSN itself. But it's a lot more likely that we can rely on the LSN changing, and on the LSN changing in a "correct way". That is, if the LSN *decreases* we know it's corrupt. If the LSN *doesn't change* we know it's corrupt. But if the LSN *increases* AND the new page now has a correct checksum, it's very most likely to be correct. You could perhaps even put cap on it saying "if the LSN increased, but less than <n>", where <n> is a sufficiently high number that it's entirely unreasonable to advanced that far between the reading of two blocks. But it has to have a very high margin in that case. > > A malicious attacker could easily trick these checks, but as Stephen pointed > > out elsewhere they would likely make the checksums valid which would escape > > detection anyway. > > > > We believe that the chances of random storage corruption passing all these > > checks is incredibly small, but eventually we'll also check against the WAL > > to be completely sure. > > The lack of check for any concurrent I/O on the follow-up retries is > disturbing. How do you guarantee that on the second retry what you > have is a torn page and not something corrupted? Init forks for > example are made of up to 2 blocks, so the window would get short for > at least those. There are many instances with tables that have few > pages as well. Here I was more worried that the window might get *too long* if tables are large :) The risk is certainly that you get a torn page *again* on the second read. It could be the same torn page (if it hasn't changed), but you can detect that (by the fact that it hasn't actually changed) and possibly do a short delay before trying again if it gets that far. That could happen if the process is too quick. It could also be that you are unlucky and that you hit a *new* write, and you were so unlucky that both times it happened to hit exactly when you were reading the page the next time. I'm not sure the chance of that happening is even big enough we have to care about it, though? -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/