Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

Andres Freund Wed, 22 Mar 2023 17:30:28 -0700

Hi,

On 2023-03-21 01:25:11 +0300, Alexander Korotkov wrote:
> I'm going to push patchset v15 if no objections.


Just saw that this went in - didn't catch up with the thread before,
unfortunately. At the very least I'd like to see some more work on cleaning up
the lazy tuple slot stuff. It's replete with unnecessary multiple-evaluation
hazards - I realize that there's some of those already, but I don't think we
should go further down that route. As far as I can tell there's no need for
any of this to be macros.


> From a8b4e8a7b27815e013ea07b8cc9ac68541a9ac07 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov <akorot...@postgresql.org>
> Date: Tue, 21 Mar 2023 00:34:15 +0300
> Subject: [PATCH 1/2] Evade extra table_tuple_fetch_row_version() in
>  ExecUpdate()/ExecDelete()
>
> When we lock tuple using table_tuple_lock() then we at the same time fetch
> the locked tuple to the slot.  In this case we can skip extra
> table_tuple_fetch_row_version() thank to we've already fetched the 'old' tuple
> and nobody can change it concurrently since it's locked.
>
> Discussion: 
> https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
> Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
> Reviewed-by: Andres Freund, Chris Travers
> ---
>  src/backend/executor/nodeModifyTable.c | 48 +++++++++++++++++++-------
>  1 file changed, 35 insertions(+), 13 deletions(-)
>
> diff --git a/src/backend/executor/nodeModifyTable.c 
> b/src/backend/executor/nodeModifyTable.c
> index 3a673895082..93ebfdbb0d8 100644
> --- a/src/backend/executor/nodeModifyTable.c
> +++ b/src/backend/executor/nodeModifyTable.c
> @@ -1559,6 +1559,22 @@ ldelete:
>                                       {
>                                               case TM_Ok:
>                                                       
> Assert(context->tmfd.traversed);
> +
> +                                                     /*
> +                                                      * Save locked tuple 
> for further processing of
> +                                                      * RETURNING clause.
> +                                                      */
> +                                                     if (processReturning &&
> +                                                             
> resultRelInfo->ri_projectReturning &&
> +                                                             
> !resultRelInfo->ri_FdwRoutine)
> +                                                     {
> +                                                             TupleTableSlot 
> *returningSlot;
> +
> +                                                             returningSlot = 
> ExecGetReturningSlot(estate, resultRelInfo);
> +                                                             
> ExecCopySlot(returningSlot, inputslot);
> +                                                             
> ExecMaterializeSlot(returningSlot);
> +                                                     }
> +
>                                                       epqslot = 
> EvalPlanQual(context->epqstate,
>                                                                               
>                    resultRelationDesc,
>                                                                               
>                    resultRelInfo->ri_RangeTableIndex,

This seems a bit byzantine. We use inputslot = EvalPlanQualSlot(...) to make
EvalPlanQual() a bit cheaper, because that avoids a slot copy inside
EvalPlanQual().  But now we copy and materialize that slot anyway - and we do
so even if EPQ fails. And we afaics also do it when epqreturnslot is set, in
which case we'll afaics never use the copied slot.

Read the next paragraph below before replying to the above - I don't think
this is right for other reasons:

> @@ -1673,12 +1689,17 @@ ldelete:
>               }
>               else
>               {
> +                     /*
> +                      * Tuple can be already fetched to the returning slot 
> in case
> +                      * we've previously locked it.  Fetch the tuple only if 
> the slot
> +                      * is empty.
> +                      */
>                       slot = ExecGetReturningSlot(estate, resultRelInfo);
>                       if (oldtuple != NULL)
>                       {
>                               ExecForceStoreHeapTuple(oldtuple, slot, false);
>                       }
> -                     else
> +                     else if (TupIsNull(slot))
>                       {
>                               if 
> (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
>                                                                               
>                    SnapshotAny, slot))


I don't think this is correct as-is - what if ExecDelete() is called with some
older tuple in the returning slot? If we don't enter the TM_Updated path, it
won't get updated, and we'll return the wrong tuple.  It certainly looks
possible to me - consider what happens if a first tuple enter the TM_Updated
path but then fails EvalPlanQual(). If a second tuple is deleted without
entering the TM_Updated path, the wrong tuple will be used for RETURNING.

<plays around with isolationtester>

Yes, indeed. The attached isolationtest breaks with 764da7710bf.


I think it's entirely sensible to avoid the tuple fetching in ExecDelete(),
but it needs a bit less localized work. Instead of using the presence of a
tuple in the returning slot, ExecDelete() should track whether it already has
fetched the deleted tuple.

Or alternatively, do the work to avoid refetching the tuple for the much more
common case of not needing EPQ at all.

I guess this really is part of my issue with this change - it optimizes the
rare case, while not addressing the same inefficiency in the common case.



> @@ -299,14 +305,46 @@ heapam_tuple_complete_speculative(Relation relation, 
> TupleTableSlot *slot,
>  static TM_Result
>  heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
>                                       Snapshot snapshot, Snapshot crosscheck, 
> bool wait,
> -                                     TM_FailureData *tmfd, bool changingPart)
> +                                     TM_FailureData *tmfd, bool changingPart,
> +                                     LazyTupleTableSlot *lockedSlot)
>  {
> +     TM_Result       result;
> +
>       /*
>        * Currently Deleting of index tuples are handled at vacuum, in case if
>        * the storage itself is cleaning the dead tuples by itself, it is the
>        * time to call the index tuple deletion also.
>        */
> -     return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, 
> changingPart);
> +     result = heap_delete(relation, tid, cid, crosscheck, wait,
> +                                              tmfd, changingPart);
> +
> +     /*
> +      * If the tuple has been concurrently updated, then get the lock on it.
> +      * (Do this if caller asked for tat by providing a 'lockedSlot'.) With 
> the
> +      * lock held retry of delete should succeed even if there are more
> +      * concurrent update attempts.
> +      */
> +     if (result == TM_Updated && lockedSlot)
> +     {
> +             TupleTableSlot *evalSlot;
> +
> +             Assert(wait);
> +
> +             evalSlot = LAZY_TTS_EVAL(lockedSlot);
> +             result = heapam_tuple_lock_internal(relation, tid, snapshot,
> +                                                                             
>         evalSlot, cid, LockTupleExclusive,
> +                                                                             
>         LockWaitBlock,
> +                                                                             
>         TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> +                                                                             
>         tmfd, true);

Oh, huh? As I mentioned before, the unconditional use of LockWaitBlock means
the wait parameter is ignored.

I'm frankly getting annoyed here.


> +/*
> + * This routine does the work for heapam_tuple_lock(), but also support
> + * `updated` argument to re-use the work done by heapam_tuple_update() or
> + * heapam_tuple_delete() on figuring out that tuple was concurrently updated.
> + */
> +static TM_Result
> +heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> +                                                Snapshot snapshot, 
> TupleTableSlot *slot,
> +                                                CommandId cid, LockTupleMode 
> mode,
> +                                                LockWaitPolicy wait_policy, 
> uint8 flags,
> +                                                TM_FailureData *tmfd, bool 
> updated)

Why is the new parameter named 'updated'?


>  {
>       BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
>       TM_Result       result;
> -     Buffer          buffer;
> +     Buffer          buffer = InvalidBuffer;
>       HeapTuple       tuple = &bslot->base.tupdata;
>       bool            follow_updates;
>
> @@ -374,16 +455,26 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, 
> Snapshot snapshot,
>
>  tuple_lock_retry:
>       tuple->t_self = *tid;
> -     result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> -                                                      follow_updates, 
> &buffer, tmfd);
> +     if (!updated)
> +             result = heap_lock_tuple(relation, tuple, cid, mode, 
> wait_policy,
> +                                                              
> follow_updates, &buffer, tmfd);
> +     else
> +             result = TM_Updated;
>
>       if (result == TM_Updated &&
>               (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
>       {
> -             /* Should not encounter speculative tuple on recheck */
> -             Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
> +             if (!updated)
> +             {
> +                     /* Should not encounter speculative tuple on recheck */
> +                     Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));

Hm, why is it ok to encounter speculative tuples in the updated case? Oh, I
guess you got failures because slot doesn't point anywhere at this point.


> -             ReleaseBuffer(buffer);
> +                     ReleaseBuffer(buffer);
> +             }
> +             else
> +             {
> +                     updated = false;
> +             }
>
>               if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
>               {

Which means this is completely bogus now?

        HeapTuple       tuple = &bslot->base.tupdata;

In the first iteration this just points to the newly created slot. Which
doesn't have a tuple stored in it. So the above checks some uninitialized
memory.


Giving up at this point.


This doesn't seem ready to have been committed.

Greetings,

Andres Freund

>From 45e727dd4265baf338c641dd1460ed222be1a68a Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Wed, 22 Mar 2023 16:47:09 -0700
Subject: [PATCH] epq-delete-returning spec test

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../isolation/expected/eval-plan-qual2.out    | 37 +++++++++++++++++++
 src/test/isolation/isolation_schedule         |  1 +
 src/test/isolation/specs/eval-plan-qual2.spec | 25 +++++++++++++
 3 files changed, 63 insertions(+)
 create mode 100644 src/test/isolation/expected/eval-plan-qual2.out
 create mode 100644 src/test/isolation/specs/eval-plan-qual2.spec

diff --git a/src/test/isolation/expected/eval-plan-qual2.out b/src/test/isolation/expected/eval-plan-qual2.out
new file mode 100644
index 00000000000..66ffdde3279
--- /dev/null
+++ b/src/test/isolation/expected/eval-plan-qual2.out
@@ -0,0 +1,37 @@
+Parsed test spec with 3 sessions
+
+starting permutation: read_u wx2 wb1 c2 c1 read_u read
+step read_u: SELECT * FROM accounts
+accountid|balance|balance2
+---------+-------+--------
+checking |    600|    1200
+savings  |    600|    1200
+(2 rows)
+
+step wx2: UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance;
+balance
+-------
+   1050
+(1 row)
+
+step wb1: DELETE FROM accounts WHERE balance = 600 RETURNING *; <waiting ...>
+step c2: COMMIT;
+step wb1: <... completed>
+accountid|balance|balance2
+---------+-------+--------
+savings  |    600|    1200
+(1 row)
+
+step c1: COMMIT;
+step read_u: SELECT * FROM accounts
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
+step read: SELECT * FROM accounts ORDER BY accountid;
+accountid|balance|balance2
+---------+-------+--------
+checking |   1050|    2100
+(1 row)
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index 4fc56ae99c9..029f7da674f 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -1,3 +1,4 @@
+test: eval-plan-qual2
 test: read-only-anomaly
 test: read-only-anomaly-2
 test: read-only-anomaly-3
diff --git a/src/test/isolation/specs/eval-plan-qual2.spec b/src/test/isolation/specs/eval-plan-qual2.spec
new file mode 100644
index 00000000000..4dae3193ca0
--- /dev/null
+++ b/src/test/isolation/specs/eval-plan-qual2.spec
@@ -0,0 +1,25 @@
+setup
+{
+ CREATE TABLE accounts (accountid text PRIMARY KEY, balance numeric not null,
+   balance2 numeric GENERATED ALWAYS AS (balance * 2) STORED);
+ INSERT INTO accounts VALUES ('checking', 600), ('savings', 600);
+}
+
+
+session s1
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step wb1	{ DELETE FROM accounts WHERE balance = 600 RETURNING *; }
+step c1		{ COMMIT; }
+
+session s2
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step wx2	{ UPDATE accounts SET balance = balance + 450 WHERE accountid = 'checking' RETURNING balance; }
+step c2	{ COMMIT; }
+
+session s3
+setup		{ BEGIN ISOLATION LEVEL READ COMMITTED; }
+step read	{ SELECT * FROM accounts ORDER BY accountid; }
+step read_u	{ SELECT * FROM accounts }
+
+
+permutation read_u wx2 wb1 c2 c1 read_u read
-- 
2.38.0

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

Reply via email to