Hi,

On 2022-11-14 17:25:31 -0800, Andres Freund wrote:
> Hm, also, shouldn't the patch adding CRS_USE_SNAPSHOT have copied more of
> SnapBuildExportSnapshot()? Why aren't the error checks for
> SnapBuildExportSnapshot() needed? Why don't we need to set XactReadOnly? Which
> transactions are we even in when we import the snapshot (cf.
> SnapBuildExportSnapshot() doing a StartTransactionCommand()).

Most of the checks for that are in CreateReplicationSlot() - but not al,
e.g. XactReadOnly isn't set, nor do we enforce in an obvious place that we
don't already hold a snapshot.

I first thought this might directly explain the problem, due to the
MyProc->xmin assignment in SnapBuildInitialSnapshot() overwriting a value that
could influence the return value for GetOldestSafeDecodingTransactionId(). But
that happens later, and we check that MyProc->xmin is invalid at the start.

But it still seems supicious. This will e.g. influence whether
StartupDecodingContext() sets PROC_IN_LOGICAL_DECODING. Which probably is
fine, but...


Another theory: I dimly remember Thomas mentioning that there's some different
behaviour of xlogreader during shutdown as part of the v15 changes. I don't
quite remember what the scenario leading up to that was. Thomas?


It's certainly interesting that we see stuff like:

2022-11-08 00:20:23.255 GMT [2012][walsender] 
[pg_16400_sync_16395_7163433409941550636][8/0:0] ERROR:  could not find record 
while sending logically-decoded data: missing contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [2012][walsender] 
[pg_16400_sync_16395_7163433409941550636][8/0:0] STATEMENT:  START_REPLICATION 
SLOT "pg_16400_sync_16395_7163433409941550636" LOGICAL 0/1D2B650 (proto_version 
'3', origin 'any', publication_names '"testpub"')
ERROR:  could not find record while sending logically-decoded data: missing 
contrecord at 0/1D3B710
2022-11-08 00:20:23.255 GMT [248][logical replication worker] ERROR:  error 
while shutting down streaming COPY: ERROR:  could not find record while sending 
logically-decoded data: missing contrecord at 0/1D3B710

It could entirely be caused by postmaster slowly killing processes after the
assertion failure and that that is corrupting shared memory state though. But
it might also be related.


Greetings,

Andres Freund


Reply via email to