On Thu, 21 Aug 2025 16:30:09 +0300
Dmitry <dsy....@yandex.ru> wrote:

> Hi hackers,
> 
> I noticed an inconsistent update when executing MERGE commands, which 
> looks more like a bug.
> In my test example, the value of 'val' should increase in an ascending 
> monotonous sequence.
> 
> Test system
> ===========
> - Architecture: x86_64
> - OS: Ubuntu 24.04.3 LTS (Noble Numbat)
> - Tested postgres version(s):
>       - latest 17 (17.6)
> 
> Steps to reproduce
> ==================
>    postgres=# create table t_merge (id int primary key, val int);
>    CREATE TABLE
>    postgres=# create table t_merge_chk (val int primary key);
>    CREATE TABLE
>    postgres=# insert into t_merge values (1,0);
>    INSERT 0 1
> 
>    pgbench --no-vacuum --exit-on-abort -c 10 --file=/dev/stdin <<'EOF'
>    begin;
>    merge into t_merge t
>    using (select 1 id) s on (t.id = s.id)
>    when matched then update set val = t.val + 1
>    returning val \gset
> 
>    -- Checking the uniqueness of a value
>    insert into t_merge_chk (val) values (:val);
>    commit;
>    EOF
> 
>    pgbench: error: client 3 script 0 aborted in command 2 query 0: 
> ERROR:  duplicate key value violates unique constraint "t_merge_chk_pkey"
>    DETAIL:  Key (val)=(2) already exists.
> 
> 
> What do you think about this?

I confirmed this issue by executing the following query concurrently
in three transactions. (With only two transactions, the issue does not occur.)

 merge into t_merge t
 using (select 1 id) s on (t.id = s.id)
 when matched then update set val = t.val + 1
 returning old.ctid,old.val,new.ctid,new.val;

Theoretically, each transaction should increment the val value
independently, so the total result should be an increment of three.
However, sometimes (not always, but often) the second and third committed
transactions end up producing the same result.

For example, when the first committed transaction returns:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,158) |   4 | (2,159) |   5
(1 row)

the second committed transaction could return:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,159) |   5 | (2,160) |   6
(1 row)


and the third committed transaction could return:

  ctid   | val |  ctid   | val 
---------+-----+---------+-----
 (2,159) |   5 | (2,161) |   6
(1 row)

This is wrong, since the old.val and new.val should be 6 and 7, respectively.

I don't completely understand how this race condition occurs,
but I believe the bug is due to the misuse of TM_FailureData
returned by table_tuple_lock in ExecMergeMatched().

Currently, TM_FailureData.ctid is used as a reference to the
latest version of oldtuple, but this is not always correct. 
Instead, the tupleid passed to table_tuple_lock should be used.

I've attached a patch to fix this.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nag...@sraoss.co.jp>
>From 4eb0f045b918325bb20d23c34fa2ec007b25b379 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Mon, 25 Aug 2025 01:57:25 +0900
Subject: [PATCH] Fix misuse of TM_FailureData.ctid in ExecMergeMatched

Previously, the ctid field of TM_FailureData returned by table_tuple_lock()
was used as a reference to the latest version of oldtuple, but
this was not always correct. Instead, the tupleid passed to
table_tuple_lock() should be used.

This misuse caused incorrect UPDATE results when more than two
MERGE commands were executed concurrently.
---
 src/backend/executor/nodeModifyTable.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7c6c2c1f6e4..ea1457ce3bf 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -3450,12 +3450,13 @@ lmerge_matched:
 									if (ItemPointerIsValid(&lockedtid))
 										UnlockTuple(resultRelInfo->ri_RelationDesc, &lockedtid,
 													InplaceUpdateTupleLock);
-									LockTuple(resultRelInfo->ri_RelationDesc, &context->tmfd.ctid,
+									LockTuple(resultRelInfo->ri_RelationDesc, tupleid,
 											  InplaceUpdateTupleLock);
-									lockedtid = context->tmfd.ctid;
+									lockedtid = *tupleid;
 								}
+
 								if (!table_tuple_fetch_row_version(resultRelationDesc,
-																   &context->tmfd.ctid,
+																   tupleid,
 																   SnapshotAny,
 																   resultRelInfo->ri_oldTupleSlot))
 									elog(ERROR, "failed to fetch the target tuple");
-- 
2.43.0

Reply via email to