Hello, everyone and Peter! Peter, I have added you because you may be interested in (or already know about) this btree-related issue.
Short description of the problem: I noticed a concurrency issue in btree index scans that affects SnapshotDirty and SnapshotSelf scan types. When using these non-MVCC snapshot types, a scan could miss tuples if concurrent transactions delete existing tuples and insert new one with different TIDs on the same page. The problem occurs because: 1. The scan reads a page and caches its tuples in backend-local storage 2. A concurrent transaction deletes a tuple and inserts a new one with a different TID 3. The scan misses the new tuple because it was already deleted by a committed transaction and does not pass visibility check 4. But new version on the page is missed, because not in cached tuples This may cause issues with: - logical replication (RelationFindReplTupleByIndex fail) - invalid conflict message (MISSING instead of ORIGIN_DIFFERS), probably other issues with upcoming conflict resolution for logical replication - check_exclusion_or_unique_constraint false negative (but currently it does not cause any real issues as far as I can see) The fix implemented in this version of the patch: - Retains the read lock on a page for SnapshotDirty and SnapshotSelf scans until we're completely done with all tuples from that page - Introduces a new 'extra_unlock' field in BTScanPos to track when a lock is being held longer than usual - Updates documentation to explain this special locking behavior Yes, it may cause some degradation in performance because of that additional lock. Another possible idea is to use a fresh MVCC snapshot for such cases (but I think it is still better to fix or at least document that issue anyway). Best regards, Mikhail. >
v5-0002-Fix-btree-index-scan-concurrency-issues-with-dirt.patch
Description: Binary data
v5-0001-Add-isolation-test-to-reproduce-dirty-snapshot-sc.patch
Description: Binary data