The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica. I started
investigation of logical decoding code and found several things which I
do not understand.
Never seen this happen. Do you have more details about what exactly is
happening?
This is transaction at primary node:
root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node1 bash
postgres@9a04a0c9f246:/pg$ pg_xlogdump
data/pg_xlog/000000010000000000000001 | fgrep "tx: 6899"
rmgr: Heap len (rec/tot): 7/ 53, tx: 6899, lsn:
0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heap len (rec/tot): 14/ 74, tx: 6899, lsn:
0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off
224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heap len (rec/tot): 7/ 53, tx: 6899, lsn:
0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
rmgr: Heap len (rec/tot): 14/ 82, tx: 6899, lsn:
0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55
xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
1663/12407/16421 blk 49
rmgr: Btree len (rec/tot): 2/ 64, tx: 6899, lsn:
0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot): 236/ 265, tx: 6899, lsn:
0/019EFD30, prev 0/019EFCF0, desc: PREPARE
pg_xlogdump: FATAL: error in WAL record at 0/1AC0E70: invalid record
length at 0/1AC0EA8: wanted 24, got 0
This is the replicated transaction at other node (it is ont clear from
the trace, but believe me, it is the same transaction):
root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node2 bash
postgres@e5b16d82ce06:/pg$ pg_xlogdump
data/pg_xlog/000000010000000000000001 | fgrep "tx: 6882"
rmgr: Heap len (rec/tot): 14/ 74, tx: 6882, lsn:
0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new
off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
rmgr: Heap len (rec/tot): 14/ 82, tx: 6882, lsn:
0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4
xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel
1663/12407/16421 blk 51
rmgr: Btree len (rec/tot): 2/ 64, tx: 6882, lsn:
0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot): 248/ 274, tx: 6882, lsn:
0/019F61F8, prev 0/019F60E8, desc: PREPARE
pg_xlogdump: FATAL: error in WAL record at 0/1AD3AD8: invalid record
length at 0/1AD3B10: wanted 24, got 0
And "shorten" version of the same transaction at the third (recovered)
node:
root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec
-ti node3 bash
postgres@b4066d4211bc:/pg$ pg_xlogdump
data/pg_xlog/000000010000000000000001 | fgrep "tx: 6893"
rmgr: Heap len (rec/tot): 7/ 53, tx: 6893, lsn:
0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
rmgr: Heap len (rec/tot): 14/ 82, tx: 6893, lsn:
0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new
off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel
1663/12407/16421 blk 50
rmgr: Btree len (rec/tot): 2/ 64, tx: 6893, lsn:
0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot): 236/ 265, tx: 6893, lsn:
0/01A298F8, prev 0/01A298B8, desc: PREPARE
pg_xlogdump: FATAL: error in WAL record at 0/1ACBBF8: invalid record
length at 0/1ACBC30: wanted 24, got 0
You can see one update instead of two.
Sorry, I have not saved trace with output of logical decoder. Bu t it
really decodes just one update!
What I have done:
DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_REPLICATION_SLOT "mtm_slot_1" LOGICAL "multimaster";
START_REPLICATION SLOT "mtm_slot_1" LOGICAL 0/0 ("startup_params_format"
'1', "max_proto_version" '1', "min_proto_version" '1',
"forward_changesets" '1', "mtm_replication_mode" 'recovered');
I have also tried to calculate last origin LSN for this node and
explicitly specify it in START_REPLICATION.
But it doesn't help: the same problem persists.