On Wed, Nov 04, 2020 at 05:41:39PM +0100, Michael Banck wrote: > Am Mittwoch, den 04.11.2020, 17:48 +0900 schrieb Michael Paquier: >> So, I have done much more testing of this patch using an instance with >> a small shared buffer pool and pgbench running in parallel for having >> a large eviction rate, and I cannot convince myself to do that. My >> laptop got easily constrained on I/O, and within a total of 2000 base >> backups or so, I have seen some 5 backup failures with a correct >> detection logic. > > I don't quite undestand what you mean here: how do the base backups > fail,
As of basebackup.c, on HEAD: if (total_checksum_failures) { if (total_checksum_failures > 1) ereport(WARNING, (errmsg_plural("%lld total checksum verification failure", "%lld total checksum verification failures", total_checksum_failures, total_checksum_failures))); ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("checksum verification failure during base backup"))); } This means that when at least one page verification fails, pg_basebackup would fail. > and what exactly is "correct detection logic"? I was referring to the patch I sent on this thread that fixes the detection of a corruption for the zero-only case and where pd_lsn and/or pg_upper are trashed by a corruption of the page header. Both cases allow a base backup to complete on HEAD, while sending pages that could be corrupted, which is wrong. Once you make the page verification rely only on pd_checksum, as the patch does because the checksum is the only source of truth in the page header, corrupted pages are correctly detected, causing pg_basebackup to complain as it should. However, it has also the risk to cause pg_basebackup to fail *and* to report as broken pages that are in the process of being written, depending on how slow a disk is able to finish a 8kB write. That's a different kind of wrongness, and users have two more reasons to be pissed. Note that if a page is found as torn we have a consistent page header, meaning that on HEAD the PageIsNew() and PageGetLSN() would pass, but the checksum verification would fail as the contents at the end of the page does not match the checksum. -- Michael
signature.asc
Description: PGP signature