Hi!

On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
>
> > /*
> >  * Lock a tuple in the specified mode.
> >  *
> >  * Input parameters:
> >  *    relation: relation containing tuple (caller must hold suitable lock)
> >  *    tid: TID of tuple to lock
> >  *    snapshot: snapshot to use for visibility determinations
> >  *    cid: current command ID (used for visibility test, and stored into
> >  *            tuple's cmax if lock is successful)
> >  *    mode: lock mode desired
> >  *    wait_policy: what to do if tuple lock is not available
> >  *    flags:
> >  *            If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update 
> > chain to
> >  *            also lock descendant tuples if lock modes don't conflict.
> >  *            If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain 
> > and lock
> >  *            latest version.
> >  *
> >  * Output parameters:
> >  *    *slot: contains the target tuple
> >  *    *tmfd: filled in failure cases (see below)
> >  *
> >  * Function result may be:
> >  *    TM_Ok: lock was successfully acquired
> >  *    TM_Invisible: lock failed because tuple was never visible to us
> >  *    TM_SelfModified: lock failed because tuple updated by self
> >  *    TM_Updated: lock failed because tuple updated by other xact
> >  *    TM_Deleted: lock failed because tuple deleted by other xact
> >  *    TM_WouldBlock: lock couldn't be acquired and wait_policy is skip
> >  *
> >  * In the failure cases other than TM_Invisible and TM_Deleted, the routine
> >  * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax.  
> > See
> >  * comments for struct TM_FailureData for additional info.
> >  */
> > static inline TM_Result
> > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot,
> >                                TupleTableSlot *slot, CommandId cid, 
> > LockTupleMode mode,
> >                                LockWaitPolicy wait_policy, uint8 flags,
> >                                TM_FailureData *tmfd)
>
> What are the semantics of the 'snapshot' argument? In the heapam
> implementation, it's not used for anything. What visibility checks the
> function might do in a different implementation? I vaguely remember that
> the idea was that the TID might not be sufficient to uniquely identify
> the row version in something like zheap, which updates the row in place.
> In that case, all the different row versions are represented by the same
> TID, and the snapshot identifies the version.
>
> There are a few callers of table_tuple_lock:
>
> 1. trigger.c: GetTupleForTrigger
> 2. nodeModifyTable.c
> 3. nodeLockRows.c
> 4. execReplication.c
>
> The first three callers pass the EState's snapshot, the same that was
> used in a table or index scan that returned the TID. That makes sense.
> But the calls in execReplication.c look fishy:
>
> >               PushActiveSnapshot(GetLatestSnapshot());
> >
> >               res = table_tuple_lock(rel, &(outslot->tts_tid), 
> > GetLatestSnapshot(),
> >                                                          outslot,
> >                                                          
> > GetCurrentCommandId(false),
> >                                                          lockmode,
> >                                                          LockWaitBlock,
> >                                                          0 /* don't follow 
> > updates */ ,
> >                                                          &tmfd);
> >
> >               PopActiveSnapshot();
> >
> >               if (should_refetch_tuple(res, &tmfd))
> >                       goto retry;
>
> Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong.
> I think the idea was to push the latest snapshot and use the same
> snapshot in the call to table_tuple_lock(). But because each call to
> GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as
> the active snapshot and passes a *different* snapshot to
> table_tuple_lock(). This went wrong in commit 5db6df0c01, which
> introduced the update/delete/insert/lock table AM interface. The
> argument to table_tuple_lock() was supposed to be GetActiveSnapshot().
>
> However, I think GetLatestSnapshot() is wrong here anyway. None of this
> matters for heapam which just ignores the 'snapshot' argument, but let's
> assume a different AM that would use the snapshot to identify the tuple
> version. The TID was fetched from an index using an index scan with
> SnapshotDirty. There's no guarantee that the LatestSnapshot would match
> the same tuple version that the index scan found. If an MVCC snapshot is
> needed, surely it should be acquired before the index scan, and used for
> the index scan as well.
>
> I see three options:
>
> 1. Remove the 'snapshot' argument, since it's not used by heapam. If we
> get a table AM where a single TID represents multiple row versions, this
> will need to be revisited.
>
> 2. Rewrite the recheck code in execReplication.c so that it uses the
> snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
> and use the same snapshot in the index scan and table_tuple_lock().
> Acquiring a snapshot isn't free though, so it would be nice to avoid
> doing that when the heapam is just going to ignore it anyway. If we go
> with this option, I think we could reuse the snapshot that is already
> active in most cases, and only take a new snapshot if the tuple was
> concurrently updated.
>
> 3. Modify the tableam interface so that the index scan can return a more
> unique identifier of the tuple version. In heapam, it could be the TID
> like today, but a different AM could return some other token. Continue
> to use SnapshotDirty in the index scan, but in the call to
> table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
> the token you got index_getnext_slot().

Thank you for your finding and analysis.  I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it.  I think allowing flexible tuple identifier is
essential for future of table AM API.  Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.

------
Regards,
Alexander Korotkov
Supabase


Reply via email to