Hi Michael, On 3/23/18 5:36 AM, Michael Banck wrote: > Am Donnerstag, den 22.03.2018, 12:22 -0400 schrieb David Steele: >> >> + if (phdr->pd_checksum != checksum) >> >> I've attached a patch that adds basic retry functionality. It's not >> terrible efficient since it rereads the entire buffer for any block >> error. A better way is to keep a bitmap for each block in the buffer, >> then on retry compare bitmaps. If the block is still bad, report it. >> If the block was corrected moved on. If a block was good before but is >> bad on retry it can be ignored. > > I have to admit I find it a bit convoluted and non-obvious on first > reading, but I'll try to check it out some more.
Yeah, I think I was influenced too much by how pgBackRest does things, which doesn't work as well here. Attached is a simpler version. > I think it would be very useful if we could come up with a testcase > showing this problem, but I guess this will be quite hard to hit > reproducibly, right? This was brought up by Robert in [1] when discussing validating checksums during backup. I don't know of any way to reproduce this issue but it seems perfectly possible, if highly unlikely. >> + ereport(WARNING, >> + (errmsg("checksum verification failed in file " >> >> I'm worried about how verbose this warning could be since there are >> 131,072 blocks per segment. It's unlikely to have that many block >> errors, but users do sometimes put files in PGDATA which look like they >> should be validated. Since these warnings all go to the server log it >> could get pretty bad. > > We only verify checksums of files in tablespaces, and I don't think > dropping random files in those is supported in any way, as opposed to > maybe the top-level PGDATA directory. So I would say that this is not a > real concern. Perhaps, but a very corrupt file is still going to spew lots of warnings into the server log. >> We should either stop warning after the first failure, or aggregate the >> failures for a file into a single message. > > I agree that major corruption could make the whole output blow up but I > would prefer to keep this feature simple for now, which implies possibly > printing out a lot of WARNING or maybe just stopping after the first > one (or first few, dunno). In my experience actual block errors are relatively rare, so there aren't likely to be more than a few in a file. More common are overwritten or transposed files, rogue files, etc. These produce a lot of output. Maybe stop after five? Regards, -- -David da...@pgmasters.net [1] https://www.postgresql.org/message-id/CA%2BTgmobHd%2B-yVJHofSWg%3Dg%2B%3DA3EiCN2wsAiEyj7dj1hhevNq9Q%40mail.gmail.com
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 9f735a2c07..fe7285f7a2 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1258,6 +1258,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf int i; pgoff_t len = 0; char page[BLCKSZ]; + bool block_retry = false; size_t pad; PageHeader phdr; int segmentno = 0; @@ -1341,6 +1342,50 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf phdr = (PageHeader) page; if (phdr->pd_checksum != checksum) { + /* + * Retry the block on the first failure. It's possible + * that we read the first 4K page of the block just + * before postgres updated the entire block so it ends + * up looking torn to us. We only need to retry once + * because the LSN should be updated to something we can + * ignore on the next pass. If the error happens again + * then it is a true validation failure. + */ + if (block_retry == false) + { + /* Reread the failed block */ + if (fseek(fp, -(cnt - BLCKSZ * i), SEEK_CUR) == -1) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fseek in file \"%s\": %m", + readfilename))); + } + + if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not reread block %d of file \"%s\": %m", + blkno, readfilename))); + } + + if (fseek(fp, cnt - BLCKSZ * i - BLCKSZ, SEEK_CUR) == -1) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not fseek in file \"%s\": %m", + readfilename))); + } + + /* Set flag so we know a retry was attempted */ + block_retry = true; + + /* Reset loop to validate the block again */ + i--; + continue; + } + ereport(WARNING, (errmsg("checksum verification failed in file " "\"%s\", block %d: calculated %X but " @@ -1350,6 +1395,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf checksum_failure = true; } } + block_retry = false; blkno++; } }