On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote: > While checking the behavior for the case of missing-contrecord, I > noticed that emode_for_corrupt_record() doesn't work as expected since > readSource is reset to XLOG_FROM_ANY after a read failure. We could > remember the last failed source but pg_wal should have been visited if > page read error happened so I changed the function so that it treats > XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.
FWIW, I am not much a fan of assuming that it is fine to use XLOG_FROM_ANY as a condition here. The comments on top of emode_for_corrupt_record() make it rather clear what the expectations are, and this is the default readSource. > (Otherwise we see "LOG: reached end-of-WAL at .." message after > "WARNING: missing contrecord at.." message.) + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); This looks rather scary to me, FWIW, and this could easily be reached if one forgets about EndOfWAL while hacking on xlogreader.c. Unlikely so, still. + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); Isn't there a risk here to break applications checking after error messages stored in the WAL reader after seeing a contrecord? + if (record->xl_tot_len == 0) + { + /* This is strictly not an invalid state, so phrase it as so. */ + report_invalid_record(state, + "record length is 0 at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + state->EndOfWAL = true; + return false; + } This assumes that a value of 0 for xl_tot_len is a synonym of the end of WAL, but cannot we have also a corrupted record in this case in the shape of xl_tot_len being 0? We validate the full record after reading the header, so it seems to me that we should not assume that things are just ending as proposed in this patch. -- Michael
signature.asc
Description: PGP signature