Hi, Thomas has reported this failure in an email [1] and shared the following links offlist with me: https://cirrus-ci.com/task/5311549010083840 https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/testrun/build/testrun/subscription/100_bugs/log/100_bugs_twoways.log https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt
The call stack is as follows: 00000063`4edff670 00007ff7`1922fcdf postgres!ExceptionalCondition( char * conditionName = 0x00007ff7`198f8050 "TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", char * fileName = 0x00007ff7`198f8020 "../src/backend/replication/logical/snapbuild.c", int lineNumber = 0n600)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67] 00000063`4edff6b0 00007ff7`192106df postgres!SnapBuildInitialSnapshot( struct SnapBuild * builder = 0x00000251`5b95bce8)+0x20f [c:\cirrus\src\backend\replication\logical\snapbuild.c @ 600] 00000063`4edff730 00007ff7`1920d9f6 postgres!CreateReplicationSlot( struct CreateReplicationSlotCmd * cmd = 0x00000251`5b94d828)+0x40f [c:\cirrus\src\backend\replication\walsender.c @ 1152] 00000063`4edff870 00007ff7`192bc9c4 postgres!exec_replication_command( char * cmd_string = 0x00000251`5b94ac68 "CREATE_REPLICATION_SLOT "pg_16400_sync_16392_7163433409941550636" LOGICAL pgoutput (SNAPSHOT 'use')")+0x4a6 [c:\cirrus\src\backend\replication\walsender.c @ 1804] As per my investigation based on the above logs, the failed test is due to the following command in 100_bugs.pl: $node_twoways->safe_psql('d2', "CREATE SUBSCRIPTION testsub CONNECTION \$\$" . $node_twoways->connstr('d1') . "\$\$ PUBLICATION testpub WITH (create_slot=false, " . "slot_name='testslot')"); It failed while creating the table sync slot. The failure happens because the xmin computed by the snap builder is lesser than what is computed by GetOldestSafeDecodingTransactionId() during initial snapshot creation for the tablesync slot by SnapBuildInitialSnapshot. To investigate, I tried to study how the values of "safeXid" and "snap->xmin" are computed in SnapBuildInitialSnapshot(). There appear to be four places in the code where we assign value to xmin (builder-xmin) during the snapshot building process and then we assign the same to snap->xmin. Those places are: (a) Two places in SnapBuildFindSnapshot(), (b) One place in SnapBuildRestore(), and (c) One place in SnapBuildProcessRunningXacts() Seeing the LOGS, it appears to me that we find a consistent point from the below code in SnapBuildFindSnapshot() and the following line assigns builder->xmin. ... if (running->oldestRunningXid == running->nextXid) { ... builder->xmin = running->nextXid; The reason is we only see "logical decoding found consistent point at ..." in LOGs. If SnapBuildFindSnapshot() has to go through the entire machinery of snapshot building then, we should have seen "logical decoding found initial starting point at ..." and similar other LOGs. The builder->xmin can't be assigned from any other place in (b) or (c). From (c), it can't be assigned because we are building a full snapshot here which won't restore any serialized snapshot. It can't be assigned from (b) because while creating a slot we stop as soon as we find the consistent point, see DecodingContextFindStartpoint()->DecodingContextReady() In the above code snippet, the running->nextXid in the above code snippet has been assigned from ShmemVariableCache->nextXid in GetRunningTransactionData(). The safeXid computed from GetOldestSafeDecodingTransactionId() uses ShmemVariableCache->nextXid as the starting point and keeps the slot's xmin as the safe Xid limit. It seems to me due to SnapBuilder's initial_xmin_horizon, we won't set (SnapBuilder's) xmin lesser than slot's effective_xmin computed in CreateInitDecodingContext(). See SnapBuildFindSnapshot(). So, ideally, SnapBuildInitialSnapshot() should never compute safeXid which is based on the minimum of all slot's effective_xmin to be greater than SnapBuilder's xmin (or snapshot's xmin). In short, the code as written seems correct to me. I have also tried to analyze if any recent commit (7f13ac8123) could cause this problem but can't think of any reason because the changes are related to the restart of decoding and the failed test is related to creating a slot the very first time. I don't have any good ideas on how to proceed with this. Any thoughts on this would be helpful? Note: I have briefly discussed this issue with Sawada-San and Kuroda-San, so kept them in Cc. [1] - https://www.postgresql.org/message-id/CA%2BhUKG%2BA_LyW%3DFAOi2ebA9Vr-1%3Desu%2BeBSm0dsVhU%3DEgfpipkg%40mail.gmail.com -- With Regards, Amit Kapila.