On Sun, Jul 2, 2023 at 1:40 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > I think there's some sort of bug in how dd38ff28ad deals with > contrecords. Consider something as simple as > > pgbench -i -s 100 > > and then doing pg_waldump on the WAL segments, I get this for every > single one: > > pg_waldump: error: error in WAL record at 0/1FFFF98: missing > contrecord at 0/1FFFFE0 > > This only happens since dd38ff28ad, and revert makes it disappear. > > It's possible we still have some issue with contrecords, but IIUC we > fixed those. So unless there's some unknown one (and considering this > seems to happen for *every* WAL segment that's hard to believe), this > seems more like an issue in the error detection.
Yeah. That message is due to this part of dd38ff28ad's change: Also add an explicit error message for missing contrecords. It was a bit strange that we didn't report an error already, and became a latent bug with prefetching, since the internal state that tracks aborted contrecords would not survive retrying, as revealed by 026_overwrite_contrecord.pl with this adjustment. Reporting an error prevents that. We can change 'missing contrecord' back to silent end-of-decoding (as it was in 14) with the attached. Here [1] is some analysis of this error that I wrote last year. The reason for my hesitation in pushing a fix was something like this: 1. For pg_waldump, it's "you told me to decode only this WAL segment, so decoding failed here", which is useless noise 2. For walsender, it's "you told me to shut down, so decoding failed here", which is useless noise 3. For crash recovery, "I ran out of data, so decoding failed here", which seems like a report-worthy condition, I think? When I added that new error I was thinking about that third case. We generally expect to detect the end of WAL replay after a crash with an error ("invalid record length ...: wanted 24, got 0" + several other possibilities), but in this rare case it would be silent. The effect on the first two cases is cosmetic, but certainly annoying. Perhaps I should go ahead and back-patch the attached change, and then we can discuss whether/how we should do a better job of distinguishing "user requested artificial end of decoding" from "unexpectedly ran out of data" for v17? [1] https://www.postgresql.org/message-id/ca+hukg+wkszpdoryeqm8_rk5uqpcqs2hgy92wimgfsk2wvk...@mail.gmail.com
From a6fbff139d27f53cf39521e0e78b734c59fd8211 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 2 Jul 2023 13:07:17 +1200 Subject: [PATCH] Silence "missing contrecord" error. Commit dd38ff28ad added a new error message "missing contrecord" when decoding fails at a contrecord. Unfortunately that caused noisy messages to be logged by pg_waldump at end of segment, and walsender when asked to shut down on a segment boundary. Remove the new error message, so that this condition signals end-of- WAL without an error. It's arguably a reportable condition that should not be silenced while performing crash recovery, but fixing that without introducing noise in the other cases will require more research. Back-patch to 15. Reported-by: Tomas Vondra <tomas.von...@enterprisedb.com> Discussion: https://postgr.es/m/6a1df56e-4656-b3ce-4b7a-a9cb41df8189%40enterprisedb.com --- src/backend/access/transam/xlogreader.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 2e7b1ba8e1..c9f9f6e98f 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -914,15 +914,11 @@ err: state->missingContrecPtr = targetPagePtr; /* - * If we got here without reporting an error, report one now so that - * XLogPrefetcherReadRecord() doesn't bring us back a second time and - * clobber the above state. Otherwise, the existing error takes - * precedence. + * If we got here without reporting an error, make sure an error is + * queued so that XLogPrefetcherReadRecord() doesn't bring us back a + * second time and clobber the above state. */ - if (!state->errormsg_buf[0]) - report_invalid_record(state, - "missing contrecord at %X/%X", - LSN_FORMAT_ARGS(RecPtr)); + state->errormsg_deferred = true; } if (decoded && decoded->oversized) -- 2.40.1