Hi, On 2022-11-14 17:25:31 -0800, Andres Freund wrote: > Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of > SnapBuildExportSnapshot()? Why aren't the error checks for > SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which > transactions are we even in when we import the snapshot (cf. > SnapBuildExportSnapshot() doing a StartTransactionCommand()).
Most of the checks for that are in CreateReplicationSlot() - but not al, e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we don't already hold a snapshot. I first thought this might directly explain the problem, due to the MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that could influence the return value for GetOldestSafeDecodingTransactionId(). But that happens later, and we check that MyProc->xmin is invalid at the start. But it still seems supicious. This will e.g. influence whether StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is fine, but... Another theory: I dimly remember Thomas mentioning that there's some different behaviour of xlogreader during shutdown as part of the v15 changes. I don't quite remember what the scenario leading up to that was. Thomas? It's certainly interesting that we see stuff like: 2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710 2022-11-08 00:20:23.255 GMT [2012][walsender] [pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT: START_REPLICATION SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version '3', origin 'any', publication_names '"testpub"') ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710 2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR: error while shutting down streaming COPY: ERROR: could not find record while sending logically-decoded data: missing contrecord at 0/1D3B710 It could entirely be caused by postmaster slowly killing processes after the assertion failure and that that is corrupting shared memory state though. But it might also be related. Greetings, Andres Freund