Re: table_tuple_lock's snapshot argument

Alexander Korotkov Tue, 21 Jan 2025 02:06:02 -0800

On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<[email protected]> wrote:
> On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <[email protected]> wrote:
> >
> > > /*
> > >  * Lock a tuple in the specified mode.
> > >  *
> > >  * Input parameters:
> > >  *    relation: relation containing tuple (caller must hold suitable lock)
> > >  *    tid: TID of tuple to lock
> > >  *    snapshot: snapshot to use for visibility determinations
> > >  *    cid: current command ID (used for visibility test, and stored into
> > >  *            tuple's cmax if lock is successful)
> > >  *    mode: lock mode desired
> > >  *    wait_policy: what to do if tuple lock is not available
> > >  *    flags:
> > >  *            If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the 
> > > update chain to
> > >  *            also lock descendant tuples if lock modes don't conflict.
> > >  *            If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update 
> > > chain and lock
> > >  *            latest version.
> > >  *
> > >  * Output parameters:
> > >  *    *slot: contains the target tuple
> > >  *    *tmfd: filled in failure cases (see below)
> > >  *
> > >  * Function result may be:
> > >  *    TM_Ok: lock was successfully acquired
> > >  *    TM_Invisible: lock failed because tuple was never visible to us
> > >  *    TM_SelfModified: lock failed because tuple updated by self
> > >  *    TM_Updated: lock failed because tuple updated by other xact
> > >  *    TM_Deleted: lock failed because tuple deleted by other xact
> > >  *    TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
> > >  *
> > >  * In the failure cases other than TM_Invisible and TM_Deleted, the 
> > > routine
> > >  * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. 
> > >  See
> > >  * comments for struct TM_FailureData for additional info.
> > >  */
> > > static inline TM_Result
> > > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
> > >                                TupleTableSlot *slot, CommandId cid, 
> > > LockTupleMode mode,
> > >                                LockWaitPolicy wait_policy, uint8 flags,
> > >                                TM_FailureData *tmfd)
> >
> > What are the semantics of the 'snapshot' argument? In the heapam
> > implementation, it's not used for anything. What visibility checks the
> > function might do in a different implementation? I vaguely remember that
> > the idea was that the TID might not be sufficient to uniquely identify
> > the row version in something like zheap, which updates the row in place.
> > In that case, all the different row versions are represented by the same
> > TID, and the snapshot identifies the version.
> >
> > There are a few callers of table_tuple_lock:
> >
> > 1. trigger.c: GetTupleForTrigger
> > 2. nodeModifyTable.c
> > 3. nodeLockRows.c
> > 4. execReplication.c
> >
> > The first three callers pass the EState's snapshot, the same that was
> > used in a table or index scan that returned the TID. That makes sense.
> > But the calls in execReplication.c look fishy:
> >
> > >               PushActiveSnapshot(GetLatestSnapshot());
> > >
> > >               res = table_tuple_lock(rel, &(outslot->tts_tid), 
> > > GetLatestSnapshot(),
> > >                                                          outslot,
> > >                                                          
> > > GetCurrentCommandId(false),
> > >                                                          lockmode,
> > >                                                          LockWaitBlock,
> > >                                                          0 /* don't 
> > > follow updates */ ,
> > >                                                          &tmfd);
> > >
> > >               PopActiveSnapshot();
> > >
> > >               if (should_refetch_tuple(res, &tmfd))
> > >                       goto retry;
> >
> > Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
> > I think the idea was to push the latest snapshot and use the same
> > snapshot in the call to table_tuple_lock(). But because each call to
> > GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
> > the active snapshot and passes a *different* snapshot to
> > table_tuple_lock(). This went wrong in commit 5db6df0c01, which
> > introduced the update/delete/insert/lock table AM interface. The
> > argument to table_tuple_lock() was supposed to be GetActiveSnapshot().
> >
> > However, I think GetLatestSnapshot() is wrong here anyway. None of this
> > matters for heapam which just ignores the 'snapshot' argument, but let's
> > assume a different AM that would use the snapshot to identify the tuple
> > version. The TID was fetched from an index using an index scan with
> > SnapshotDirty. There's no guarantee that the LatestSnapshot would match
> > the same tuple version that the index scan found. If an MVCC snapshot is
> > needed, surely it should be acquired before the index scan, and used for
> > the index scan as well.
> >
> > I see three options:
> >
> > 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
> > get a table AM where a single TID represents multiple row versions, this
> > will need to be revisited.
> >
> > 2. Rewrite the recheck code in execReplication.c so that it uses the
> > snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
> > and use the same snapshot in the index scan and table_tuple_lock().
> > Acquiring a snapshot isn't free though, so it would be nice to avoid
> > doing that when the heapam is just going to ignore it anyway. If we go
> > with this option, I think we could reuse the snapshot that is already
> > active in most cases, and only take a new snapshot if the tuple was
> > concurrently updated.
> >
> > 3. Modify the tableam interface so that the index scan can return a more
> > unique identifier of the tuple version. In heapam, it could be the TID
> > like today, but a different AM could return some other token. Continue
> > to use SnapshotDirty in the index scan, but in the call to
> > table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
> > the token you got index_getnext_slot().
>
> Thank you for your finding and analysis.  I would vote for #3.
> Currently snapshot argument is not properly used, and the heapam code
> doesn't check it.  I think allowing flexible tuple identifier is
> essential for future of table AM API.  Therefore, any
> snapshot/visibility information could be put inside that tuple
> identifier if needed.


To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.

------
Regards,
Alexander Korotkov
Supabase

Re: table_tuple_lock's snapshot argument

Reply via email to