On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov <aekorot...@gmail.com> wrote: > On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinn...@iki.fi> wrote: > > > > > /* > > > * Lock a tuple in the specified mode. > > > * > > > * Input parameters: > > > * relation: relation containing tuple (caller must hold suitable lock) > > > * tid: TID of tuple to lock > > > * snapshot: snapshot to use for visibility determinations > > > * cid: current command ID (used for visibility test, and stored into > > > * tuple's cmax if lock is successful) > > > * mode: lock mode desired > > > * wait_policy: what to do if tuple lock is not available > > > * flags: > > > * If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the > > > update chain to > > > * also lock descendant tuples if lock modes don't conflict. > > > * If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update > > > chain and lock > > > * latest version. > > > * > > > * Output parameters: > > > * *slot: contains the target tuple > > > * *tmfd: filled in failure cases (see below) > > > * > > > * Function result may be: > > > * TM_Ok: lock was successfully acquired > > > * TM_Invisible: lock failed because tuple was never visible to us > > > * TM_SelfModified: lock failed because tuple updated by self > > > * TM_Updated: lock failed because tuple updated by other xact > > > * TM_Deleted: lock failed because tuple deleted by other xact > > > * TM_WouldBlock: lock couldn't be acquired and wait_policy is skip > > > * > > > * In the failure cases other than TM_Invisible and TM_Deleted, the > > > routine > > > * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. > > > See > > > * comments for struct TM_FailureData for additional info. > > > */ > > > static inline TM_Result > > > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot, > > > TupleTableSlot *slot, CommandId cid, > > > LockTupleMode mode, > > > LockWaitPolicy wait_policy, uint8 flags, > > > TM_FailureData *tmfd) > > > > What are the semantics of the 'snapshot' argument? In the heapam > > implementation, it's not used for anything. What visibility checks the > > function might do in a different implementation? I vaguely remember that > > the idea was that the TID might not be sufficient to uniquely identify > > the row version in something like zheap, which updates the row in place. > > In that case, all the different row versions are represented by the same > > TID, and the snapshot identifies the version. > > > > There are a few callers of table_tuple_lock: > > > > 1. trigger.c: GetTupleForTrigger > > 2. nodeModifyTable.c > > 3. nodeLockRows.c > > 4. execReplication.c > > > > The first three callers pass the EState's snapshot, the same that was > > used in a table or index scan that returned the TID. That makes sense. > > But the calls in execReplication.c look fishy: > > > > > PushActiveSnapshot(GetLatestSnapshot()); > > > > > > res = table_tuple_lock(rel, &(outslot->tts_tid), > > > GetLatestSnapshot(), > > > outslot, > > > > > > GetCurrentCommandId(false), > > > lockmode, > > > LockWaitBlock, > > > 0 /* don't > > > follow updates */ , > > > &tmfd); > > > > > > PopActiveSnapshot(); > > > > > > if (should_refetch_tuple(res, &tmfd)) > > > goto retry; > > > > Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong. > > I think the idea was to push the latest snapshot and use the same > > snapshot in the call to table_tuple_lock(). But because each call to > > GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as > > the active snapshot and passes a *different* snapshot to > > table_tuple_lock(). This went wrong in commit 5db6df0c01, which > > introduced the update/delete/insert/lock table AM interface. The > > argument to table_tuple_lock() was supposed to be GetActiveSnapshot(). > > > > However, I think GetLatestSnapshot() is wrong here anyway. None of this > > matters for heapam which just ignores the 'snapshot' argument, but let's > > assume a different AM that would use the snapshot to identify the tuple > > version. The TID was fetched from an index using an index scan with > > SnapshotDirty. There's no guarantee that the LatestSnapshot would match > > the same tuple version that the index scan found. If an MVCC snapshot is > > needed, surely it should be acquired before the index scan, and used for > > the index scan as well. > > > > I see three options: > > > > 1. Remove the 'snapshot' argument, since it's not used by heapam. If we > > get a table AM where a single TID represents multiple row versions, this > > will need to be revisited. > > > > 2. Rewrite the recheck code in execReplication.c so that it uses the > > snapshot in a more consistent fashion. Call GetLatestSnapshot() first, > > and use the same snapshot in the index scan and table_tuple_lock(). > > Acquiring a snapshot isn't free though, so it would be nice to avoid > > doing that when the heapam is just going to ignore it anyway. If we go > > with this option, I think we could reuse the snapshot that is already > > active in most cases, and only take a new snapshot if the tuple was > > concurrently updated. > > > > 3. Modify the tableam interface so that the index scan can return a more > > unique identifier of the tuple version. In heapam, it could be the TID > > like today, but a different AM could return some other token. Continue > > to use SnapshotDirty in the index scan, but in the call to > > table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass > > the token you got index_getnext_slot(). > > Thank you for your finding and analysis. I would vote for #3. > Currently snapshot argument is not properly used, and the heapam code > doesn't check it. I think allowing flexible tuple identifier is > essential for future of table AM API. Therefore, any > snapshot/visibility information could be put inside that tuple > identifier if needed.
To be more specific, I vote for #1 + #3. Remove the snapshot argument for now, then work on flexible tuple identifier. ------ Regards, Alexander Korotkov Supabase