On Tue, Apr 10, 2018 at 11:45 PM, David Steele <da...@pgmasters.net> wrote:
> On 4/10/18 5:24 PM, Tomas Vondra wrote: > >> >> I think there's a bug in sendFile(). We do check checksums on all pages >> that pass this LSN check: >> >> /* >> * Only check pages which have not been modified since the >> * start of the base backup. Otherwise, they might have been >> * written only halfway and the checksum would not be valid. >> * However, replaying WAL would reinstate the correct page in >> * this case. >> */ >> if (PageGetLSN(page) < startptr) >> { >> ... >> } >> >> Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0 >> too, and we'll try to verify the checksum - but we actually do not set >> checksums on empty pages. >> >> So I think it should be something like this: >> >> if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr)) >> { >> ... >> } >> >> It might be worth verifying that the page is actually all-zeroes (and >> not just with corrupted pd_upper value. Not sure it's worth it. >> >> I've found this by fairly trivial stress testing - running pgbench and >> pg_basebackup in a loop. It was failing pretty reliably (~75% of runs). >> With the proposed change I see no further failures. >> > > Good catch, Tomas! > > I should have seen this since I had the same issue when I implemented this > feature in pgBackRest. > > Anyway, I agree that your fix looks correct. > I've applied this fix, along with the same thing in pg_verify_checksums. -- Magnus Hagander Me: https://www.hagander.net/ <http://www.hagander.net/> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>