Hi! On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinn...@iki.fi> wrote: > > > /* > > * Lock a tuple in the specified mode. > > * > > * Input parameters: > > * relation: relation containing tuple (caller must hold suitable lock) > > * tid: TID of tuple to lock > > * snapshot: snapshot to use for visibility determinations > > * cid: current command ID (used for visibility test, and stored into > > * tuple's cmax if lock is successful) > > * mode: lock mode desired > > * wait_policy: what to do if tuple lock is not available > > * flags: > > * If TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS, follow the update > > chain to > > * also lock descendant tuples if lock modes don't conflict. > > * If TUPLE_LOCK_FLAG_FIND_LAST_VERSION, follow the update chain > > and lock > > * latest version. > > * > > * Output parameters: > > * *slot: contains the target tuple > > * *tmfd: filled in failure cases (see below) > > * > > * Function result may be: > > * TM_Ok: lock was successfully acquired > > * TM_Invisible: lock failed because tuple was never visible to us > > * TM_SelfModified: lock failed because tuple updated by self > > * TM_Updated: lock failed because tuple updated by other xact > > * TM_Deleted: lock failed because tuple deleted by other xact > > * TM_WouldBlock: lock couldn't be acquired and wait_policy is skip > > * > > * In the failure cases other than TM_Invisible and TM_Deleted, the routine > > * fills *tmfd with the tuple's t_ctid, t_xmax, and, if possible, t_cmax. > > See > > * comments for struct TM_FailureData for additional info. > > */ > > static inline TM_Result > > table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot, > > TupleTableSlot *slot, CommandId cid, > > LockTupleMode mode, > > LockWaitPolicy wait_policy, uint8 flags, > > TM_FailureData *tmfd) > > What are the semantics of the 'snapshot' argument? In the heapam > implementation, it's not used for anything. What visibility checks the > function might do in a different implementation? I vaguely remember that > the idea was that the TID might not be sufficient to uniquely identify > the row version in something like zheap, which updates the row in place. > In that case, all the different row versions are represented by the same > TID, and the snapshot identifies the version. > > There are a few callers of table_tuple_lock: > > 1. trigger.c: GetTupleForTrigger > 2. nodeModifyTable.c > 3. nodeLockRows.c > 4. execReplication.c > > The first three callers pass the EState's snapshot, the same that was > used in a table or index scan that returned the TID. That makes sense. > But the calls in execReplication.c look fishy: > > > PushActiveSnapshot(GetLatestSnapshot()); > > > > res = table_tuple_lock(rel, &(outslot->tts_tid), > > GetLatestSnapshot(), > > outslot, > > > > GetCurrentCommandId(false), > > lockmode, > > LockWaitBlock, > > 0 /* don't follow > > updates */ , > > &tmfd); > > > > PopActiveSnapshot(); > > > > if (should_refetch_tuple(res, &tmfd)) > > goto retry; > > Firstly, this calls GetLatestSnapshot() twice, which seems plain wrong. > I think the idea was to push the latest snapshot and use the same > snapshot in the call to table_tuple_lock(). But because each call to > GetLatestSnapshot() takes a new snapshot, this pushes one snapshot as > the active snapshot and passes a *different* snapshot to > table_tuple_lock(). This went wrong in commit 5db6df0c01, which > introduced the update/delete/insert/lock table AM interface. The > argument to table_tuple_lock() was supposed to be GetActiveSnapshot(). > > However, I think GetLatestSnapshot() is wrong here anyway. None of this > matters for heapam which just ignores the 'snapshot' argument, but let's > assume a different AM that would use the snapshot to identify the tuple > version. The TID was fetched from an index using an index scan with > SnapshotDirty. There's no guarantee that the LatestSnapshot would match > the same tuple version that the index scan found. If an MVCC snapshot is > needed, surely it should be acquired before the index scan, and used for > the index scan as well. > > I see three options: > > 1. Remove the 'snapshot' argument, since it's not used by heapam. If we > get a table AM where a single TID represents multiple row versions, this > will need to be revisited. > > 2. Rewrite the recheck code in execReplication.c so that it uses the > snapshot in a more consistent fashion. Call GetLatestSnapshot() first, > and use the same snapshot in the index scan and table_tuple_lock(). > Acquiring a snapshot isn't free though, so it would be nice to avoid > doing that when the heapam is just going to ignore it anyway. If we go > with this option, I think we could reuse the snapshot that is already > active in most cases, and only take a new snapshot if the tuple was > concurrently updated. > > 3. Modify the tableam interface so that the index scan can return a more > unique identifier of the tuple version. In heapam, it could be the TID > like today, but a different AM could return some other token. Continue > to use SnapshotDirty in the index scan, but in the call to > table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass > the token you got index_getnext_slot().
Thank you for your finding and analysis. I would vote for #3. Currently snapshot argument is not properly used, and the heapam code doesn't check it. I think allowing flexible tuple identifier is essential for future of table AM API. Therefore, any snapshot/visibility information could be put inside that tuple identifier if needed. ------ Regards, Alexander Korotkov Supabase