On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote: > We are not just looking at one LSN value. Here are the steps we are > proposing (I'll skip checks for zero pages here): > > 1) Test the page checksum. If it passes the page is OK. > 2) If the checksum does not pass then record the page offset and LSN and > continue.
But here the checksum is broken, so while the offset is something we can rely on how do you make sure that the LSN is fine? A broken checksum could perfectly mean that the LSN is actually *not* fine if the page header got corrupted. > 3) After the file is copied, reopen and reread the file, seeking to offsets > where possible invalid pages were recorded in the first pass. > a) If the page is now valid then it is OK. > b) If the page is not valid but the LSN has increased from the LSN Per se the previous point about the LSN value that we cannot rely on. > A malicious attacker could easily trick these checks, but as Stephen pointed > out elsewhere they would likely make the checksums valid which would escape > detection anyway. > > We believe that the chances of random storage corruption passing all these > checks is incredibly small, but eventually we'll also check against the WAL > to be completely sure. The lack of check for any concurrent I/O on the follow-up retries is disturbing. How do you guarantee that on the second retry what you have is a torn page and not something corrupted? Init forks for example are made of up to 2 blocks, so the window would get short for at least those. There are many instances with tables that have few pages as well. -- Michael
signature.asc
Description: PGP signature