On 21/01/2025 12:05, Alexander Korotkov wrote:
On Tue, Jan 21, 2025 at 12:03 PM Alexander Korotkov
<aekorot...@gmail.com> wrote:
On Sun, Dec 29, 2024 at 3:59 PM Heikki Linnakangas <hlinn...@iki.fi> wrote:
However, I think GetLatestSnapshot() is wrong here anyway. None of this
matters for heapam which just ignores the 'snapshot' argument, but let's
assume a different AM that would use the snapshot to identify the tuple
version. The TID was fetched from an index using an index scan with
SnapshotDirty. There's no guarantee that the LatestSnapshot would match
the same tuple version that the index scan found. If an MVCC snapshot is
needed, surely it should be acquired before the index scan, and used for
the index scan as well.
I see three options:
1. Remove the 'snapshot' argument, since it's not used by heapam. If we
get a table AM where a single TID represents multiple row versions, this
will need to be revisited.
2. Rewrite the recheck code in execReplication.c so that it uses the
snapshot in a more consistent fashion. Call GetLatestSnapshot() first,
and use the same snapshot in the index scan and table_tuple_lock().
Acquiring a snapshot isn't free though, so it would be nice to avoid
doing that when the heapam is just going to ignore it anyway. If we go
with this option, I think we could reuse the snapshot that is already
active in most cases, and only take a new snapshot if the tuple was
concurrently updated.
3. Modify the tableam interface so that the index scan can return a more
unique identifier of the tuple version. In heapam, it could be the TID
like today, but a different AM could return some other token. Continue
to use SnapshotDirty in the index scan, but in the call to
table_tuple_lock(), instead of passing GetLatestSnapshot() and TID, pass
the token you got index_getnext_slot().
Thank you for your finding and analysis. I would vote for #3.
Currently snapshot argument is not properly used, and the heapam code
doesn't check it. I think allowing flexible tuple identifier is
essential for future of table AM API. Therefore, any
snapshot/visibility information could be put inside that tuple
identifier if needed.
To be more specific, I vote for #1 + #3. Remove the snapshot argument
for now, then work on flexible tuple identifier.
I committed and backpatched the trivial fix for not, just replacing
GetLatestSnapshot() with GetActiveSnapshot(). I got cold feet on just
removing the argument without providing a replacement, because it's not
100% clear to me if the current situation is broken or not.
To summarize the issue, RelationFindReplTupleByIndex() performs an index
scan with DirtySnapshot, and then calls table_tuple_lock() with a fresh
MVCC snapshot to lock the row. There's no guarantee that the index scan
found the same row version that table_tuple_lock() will lock, if the TID
alone doesn't uniquely identify it. But that's still OK as long as the
key column hasn't changed, like with heapam's HOT updates. I couldn't
convince myself that it's broken nor that it's guaranteed to be correct
with other AMs.
Thanks!
--
Heikki Linnakangas
Neon (https://neon.tech)