On Mon, Sep 12, 2022 at 10:42 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Sat, Sep 10, 2022 at 10:00 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Sat, Sep 10, 2022 at 9:45 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > > > Masahiko Sawada <sawada.m...@gmail.com> writes: > > > > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef > > > > is relevant. > > > > > > Noting that the errors have only appeared in the past couple of > > > days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f > > > (Fix recovery_prefetch with low maintenance_io_concurrency). > > > > Yeah, I also just spotted the coincidence of those failures while > > monitoring the build farm. I'll look into this later today. My > > initial suspicion is that there was pre-existing code here that was > > (incorrectly?) relying on the lack of error reporting in that case. > > But maybe I misunderstood and it was incorrect to report the error for > > some reason that was not robustly covered with tests. > > After I wrote that I saw Sawada-san's message and waited for more > information, and I see there was now a commit. I noticed that > peripatus was already logging the 'missing contrecord' error even when > it didn't fail the test, and still does. I'm still looking into that > (ie whether I need to take that new report_invalid_record() call out > and replace it with errormsg_deferred = true so that XLogReadRecord() > returns NULL with no error message in this case).
I will go ahead and remove this new error message added by adb46615. The message was correlated with the problem on peripatus fixed by 88f48831, but not the cause of it -- but it's also not terribly helpful and might be confusing. It might be reported: (1) in pg_waldump when you hit the end of the segment with a missing contrecord after the end, arguably rightfully, but then perhaps someone might complain that they expect an error from pg_waldump only on the final live segment at the end of the WAL, and (2) in a walsender that is asked to shut down while between reads of pages with a spanning contrecord, reported by logical_read_xlog_page() with a messageless error (presumably peripatus's case?), (3) in crash recovery with wal_recycle off (whereas normally you'd expect to see the page_addr message from a recycled file), maybe more legitimately than the above. The problem I needed to solve can be solved without the message just by setting that flag as mentioned above, so I'll do that to remove the new noise.