On Thu, May 7, 2020 at 10:25 AM Stephen Frost <sfr...@snowman.net> wrote:
> Greetings, > > * Ashwin Agrawal (aagra...@pivotal.io) wrote: > > On Wed, May 6, 2020 at 3:02 PM Robert Haas <robertmh...@gmail.com> > wrote: > > > On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal <aagra...@pivotal.io> > wrote: > > > > If pg_basebackup is not able to read BLCKSZ content from file, then > it > > > > just emits a warning "could not verify checksum in file "____" block > > > > X: read buffer size X and page size 8192 differ" currently but misses > > > > to error with "checksum error occurred". Only if it can read 8192 and > > > > checksum mismatch happens will it error in the end. > > > > > > I don't think it's a good idea to conflate "hey, we can't checksum > > > this because the size is strange" with "hey, the checksum didn't > > > match". Suppose the a file has 1000 full blocks and a partial block. > > > All 1000 blocks have good checksums. With your change, ISTM that we'd > > > first emit a warning saying that the checksum couldn't be verified, > > > and then we'd emit a second warning saying that there was 1 checksum > > > verification failure, which would also be reported to the stats > > > system. I don't think that's what we want. > > > > I feel the intent of reporting "total checksum verification failure" is > to > > report corruption. Which way is the secondary piece of the puzzle. Not > > being able to read checksum itself to verify is also corruption and is > > checksum verification failure I think. WARNINGs will provide fine grained > > clarity on what type of checksum verification failure it is, so I am not > > sure we really need fine grained clarity in "total numbers" to > > differentiate these two types. > > Are we absolutely sure that there's no way for a partial block to end up > being seen by pg_basebackup, which is just doing routine filesystem > read() calls, during normal operation though..? Across all platforms? > Okay, that's a good point, I didn't think about it. This comment to skip verifying checksum, I suppose convinces, can't be sure and hence can't report partial blocks as corruption. /* * Only check pages which have not been modified since the * start of the base backup. Otherwise, they might have been * written only halfway and the checksum would not be valid. * However, replaying WAL would reinstate the correct page in * this case. We also skip completely new pages, since they * don't have a checksum yet. */ Might be nice to have a similar comment for the partial block case to document why we can't report it as corruption. Thanks.