On 13/10/2025 10:48, Jasper Smit wrote:
We found a bug in heap_lock_tuple(). It seems to be unsafe to follow
the t_ctid pointer of updated tuples,
  in the presence of aborted tuples. Vacuum can mark aborted tuples
LP_UNUSED, even if there are still
  visible tuples that have a t_ctid pointing to them.

We would like to address this issue in postgres. Attached is a patch
that modifies VACUUM behavior.
Could you advise us on the best way to proceed? How can we get this
patch included in postgres, or would you recommend
pursuing an alternative approach?

You're at the right place :-). Thanks for the script and the patch!

The following scenario will lead to an assertion in debug or to wrong
results in release.

1. (session 1) A tuple is being UPDATEd in an active transaction.
2. (session 2) Another session locks that tuple, for example, SELECT
.. FOR UPDATE.
3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns
TM_BeingModified.
    It will record the xmax and t_ctid of the tuple.
4. Since the tuple is actively being updated, we need to wait for
session 1 to complete.
5. The UPDATE of session 1 aborts, leaving the original tuple with an
aborted xmax and t_ctid pointing to a
     tuple with an aborted xmin.
6. VACUUM marks aborted tuples as unused, regardless of whether they
are still referenced by a visible tuple through t_ctid.
    As a result, the aborted tuple is marked unused.
7. An INSERT happens and the tuple is placed at the recycled location
of the aborted tuple.
8. The newly INSERTed tuple is UPDATEd.
9. heap_lock_tuple() of session 2 resumes, and will next call
heap_lock_updated_tuple() using the t_ctid which was recorded in step
3.
    Note that the code does not check the status of the transaction it
was waiting on.
    It will proceed, regardless of the transaction it was waiting on
committed or aborted.
10. heap_lock_updated_tuple() will now work on the inserted tuple of
step 7. It will see that this tuple updated,
     and thus the return value of heap_lock_updated_tuple() will be TM_Updated.
11. This will eventually lead to the assertion (result ==
TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)

These concrete steps reproduce the problem:

Used version REL_18_0
Run with configuration:
io_combine_limit = '1'

session 1:
drop table if exists t;
create table t (id int);
insert into t (select generate_series(1, 452));
begin;
update t set id = 1000 where id = 1;

session 2:
gdb: b heapam.c:5033
gdb: continue
select * from t where id = 1 for update;

session 1:
abort;
select * from t;
VACUUM (TRUNCATE off);
insert into t values (453) returning ctid; -- Should be (2,1)
update t set id = 454 where id = 453 returning ctid;

session 2: (should now be at the breakpoint)
gdb: continue

We get the assertion:
(result == TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID).
Stacktrace included as attachment.

I can reproduce this.

We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u)
to be updated in table \"%s\"" in our testing.
  We believe that this has a similar cause. However, we have not been
able to come up with reproduction steps for that assertion
specifically.

Yeah, sounds quite plausible that you could get that too.

We think that this problem can be fixed in two different ways.
1. Check the commit status after waiting in heap_lock_updated_tuple(),
and don't proceed checking the updated tuple when the commit has
aborted.

Hmm, I think heap_lock_updated_tuple() would still lock incorrect tuples in that case, which could lead to unrelated transactions failing, or at least waiting unnecessarily.

or check, like in heap_get_latest_tid(), if the xmin of the updated
tuples matches the xmax of the old tuple.

+1 for that approach. heap_lock_updated_tuple() actually does that check too, but not for the first tuple.

2. Change vacuum, to delay marking aborted tuples dead until no
visible tuple will have a t_ctid pointing to that tuple.

I guess that works too, but the downside is that we can't vacuum aborted tuples as fast.

Attached is a fix by checking the prior xmax in heap_lock_updated_tuple() for the first tuple already. The first commit adds your repro script as a regression test using an injection point. It hits the assertion failure, or returns incorrect result if assertions are disabled. The second commit fixes the bug and makes the test pass.

The comment on heap_lock_updated_tuple says:

 * The initial tuple is assumed to be already locked.

But AFAICS that's not true. The caller holds the heavy-weight tuple lock, to establish priority if there are multiple concurrent lockers, but it doesn't stamp the initial tuple's xmax until after the heap_lock_updated_tuple() call. I guess we just need to update that comment, but I would love to get another pair of eyes on this, this code is hairy.

- Heikki
From a7d7532a672129abf56f0819d431f8b6e6b3613f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Thu, 11 Dec 2025 18:20:37 +0200
Subject: [PATCH v1 1/2] Add repro for heap locking visibility bug as a tap
 test

Discussion: https://www.postgresql.org/message-id/CAOG+RQ74x0q=kgBBQ=mezuvoezbfsxm1qu_o0v28bwdz3dh...@mail.gmail.com
---
 src/backend/access/heap/heapam.c              |  2 +
 .../expected/heap_lock_update.out             | 33 +++++++++
 src/test/modules/injection_points/meson.build |  1 +
 .../specs/heap_lock_update.spec               | 72 +++++++++++++++++++
 4 files changed, 108 insertions(+)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 225f9829f22..a69df8bd431 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6069,6 +6069,8 @@ static TM_Result
 heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
 	 * If the tuple has not been updated, or has moved into another partition
 	 * (effectively a delete) stop here.
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..b6308dcd316
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,33 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert: 
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8d6f662040d..e92e7348a80 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -51,6 +51,7 @@ tests += {
       'reindex-concurrently-upsert',
       'reindex-concurrently-upsert-on-constraint',
       'reindex-concurrently-upsert-partitioned',
+      'heap_lock_update',
     ],
     'runningcheck': false, # see syscache-update-pruned
     # Some tests wait for all snapshots, so avoid parallel execution
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..f85b11532e9
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,72 @@
+# XXX
+# Test race conditions involving:
+# - s1: heap_update($FROM_SYSCACHE), without a snapshot or pin
+# - s2: ALTER TABLE making $FROM_SYSCACHE a dead tuple
+# - s3: "VACUUM pg_class" making $FROM_SYSCACHE become LP_UNUSED
+
+# This is a derivative work of inplace.spec, which exercises the corresponding
+# race condition for inplace updates.
+
+# Despite local injection points, this is incompatible with runningcheck.
+# First, removable_cutoff() could move backward, per its header comment.
+# Second, other activity could trigger sinval queue overflow, negating our
+# efforts to delay inval.  Third, this deadlock emerges:
+#
+# - step at2 waits at an injection point, with interrupts held
+# - an unrelated backend waits for at2 to do PROCSIGNAL_BARRIER_SMGRRELEASE
+# - step waitprunable4 waits for the unrelated backend to release its xmin
+
+# The alternative expected output is for -DCATCACHE_FORCE_RELEASE, a setting
+# that thwarts testing the race conditions this spec seeks.
+
+
+# XXX Need s2 to make a non-HOT update.  Otherwise, "VACUUM pg_class" would leave
+# an LP_REDIRECT that persists.  To get non-HOT, make rels so the pg_class row
+# for vactest.orig50 is on a filled page (assuming BLCKSZ=8192).  Just to save
+# on filesystem syscalls, use relkind=c for every other rel.
+setup
+{
+    CREATE EXTENSION injection_points;
+
+    create table t (id int);
+    insert into t (select generate_series(1, 452));
+}
+teardown
+{
+    drop table t;
+    DROP EXTENSION injection_points;
+}
+
+# Wait during GRANT.  Disable debug_discard_caches, since we're here to
+# exercise an outcome that happens under permissible cache staleness.
+session s1
+step s1begin	{ begin; }
+step s1update   { update t set id = 1000 where id = 1; }
+step s1abort	{ abort; }
+step vacuum	{ VACUUM (TRUNCATE off); }
+step reinsert   {
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+step wake	{
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SET debug_discard_caches = 0;
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert
+	wake(s2lock)
-- 
2.47.3

From 138b11e1a787f0d2b3fc913eec3c8d28527ee76c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Thu, 11 Dec 2025 18:03:10 +0200
Subject: [PATCH v1 2/2] Fix the bug with priorXmax

---
 src/backend/access/heap/heapam.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a69df8bd431..5fb81af8d4f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -86,6 +86,7 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  TransactionId *result_xmax, uint16 *result_infomask,
 									  uint16 *result_infomask2);
 static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
+										 TransactionId priorXmax,
 										 const ItemPointerData *ctid, TransactionId xid,
 										 LockTupleMode mode);
 static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
@@ -4818,9 +4819,16 @@ l3:
 				 */
 				if (follow_updates && updated)
 				{
+					TransactionId updateXid;
 					TM_Result	res;
 
-					res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+					if (infomask & HEAP_XMAX_IS_MULTI)
+						updateXid = MultiXactIdGetUpdateXid(xwait, infomask);
+					else
+						updateXid = xwait;
+
+					res = heap_lock_updated_tuple(relation, tuple, updateXid,
+												  &t_ctid,
 												  GetCurrentTransactionId(),
 												  mode);
 					if (res != TM_Ok)
@@ -5065,9 +5073,16 @@ l3:
 			/* if there are updates, follow the update chain */
 			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
 			{
+				TransactionId updateXid;
 				TM_Result	res;
 
-				res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+				if (infomask & HEAP_XMAX_IS_MULTI)
+					updateXid = MultiXactIdGetUpdateXid(xwait, infomask);
+				else
+					updateXid = xwait;
+
+				res = heap_lock_updated_tuple(relation, tuple, updateXid,
+											  &t_ctid,
 											  GetCurrentTransactionId(),
 											  mode);
 				if (res != TM_Ok)
@@ -5721,7 +5736,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5750,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6066,7 +6081,8 @@ out_unlocked:
  * levels, because that would lead to a serializability failure.
  */
 static TM_Result
-heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
+heap_lock_updated_tuple(Relation rel, HeapTuple tuple, TransactionId priorXmax,
+						const ItemPointerData *ctid,
 						TransactionId xid, LockTupleMode mode)
 {
 	INJECTION_POINT("heap_lock_updated_tuple", NULL);
@@ -6089,7 +6105,7 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		return heap_lock_updated_tuple_rec(rel, priorXmax, ctid, xid, mode);
 	}
 
 	/* nothing to lock */
-- 
2.47.3

Reply via email to