On Tue, Nov 15, 2022 at 3:38 PM Andres Freund <and...@anarazel.de> wrote: > On 2022-11-14 17:25:31 -0800, Andres Freund wrote: > Another theory: I dimly remember Thomas mentioning that there's some different > behaviour of xlogreader during shutdown as part of the v15 changes. I don't > quite remember what the scenario leading up to that was. Thomas?
Yeah. So as mentioned in: https://www.postgresql.org/message-id/CA%2BhUKG%2BWKsZpdoryeqM8_rk5uQPCqS2HGY92WiMGFsK2wVkcig%40mail.gmail.com I still have on my list to remove a new "missing contrecord" error that can show up in a couple of different scenarios that aren't exactly error conditions depending on how you think about it, but I haven't done that yet. I am not currently aware of anything bad happening because of those messages, but I could be wrong. > It's certainly interesting that we see stuff like: > > 2022-11-08 00:20:23.255 GMT [2012][walsender] > [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find > record while sending logically-decoded data: missing contrecord at 0/1D3B710 > 2022-11-08 00:20:23.255 GMT [2012][walsender] > [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: > START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL > 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"') > ERROR: could not find record while sending logically-decoded data: missing > contrecord at 0/1D3B710 > 2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error > while shutting down streaming COPY: ERROR: could not find record while > sending logically-decoded data: missing contrecord at 0/1D3B710 Right, so that might fit the case described in my email above: logical_read_xlog_page() notices that it has been asked to shut down when it is between reads of pages with a spanning contrecord. Before, it would fail silently, so XLogReadRecord() returns NULL without setting *errmsg, but now it complains about a missing contrecord. In the case where it was showing up on that other thread, just a few machines often seemed to log that error when shutting down -- peripatus for example -- I don't know why, but I assume something to do with shutdown timing and page alignment. > It could entirely be caused by postmaster slowly killing processes after the > assertion failure and that that is corrupting shared memory state though. But > it might also be related. Hmm.