On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <p...@bowt.ie> wrote: > I've always thought that the whole idiom of testing > BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if > it makes sense to totally replace those calls/tests with similar tests > of the new so->dropPin field.
It's definitely possible to avoid testing BTScanPosIsPinned like this entirely. Attached v2 patch shows how this is possible. v2 fully removes BTScanPosUnpinIfPinned, and limits the use of BTScanPosIsPinned to assertions. I think that this approach makes our buffer pin resource management strategy much clearer, particularly in cases involving mark/restore. This might help with Tomas Vondra's index prefetching patch -- I know that Tomas found the mark/restore aspects of his latest approach challenging. In v2, I added an assertion at the top of _bt_steppage that independently verify the same conditions to the ones that are already tested whenever _bt_steppage calls _bt_killitems (these include the assertion that Alexander showed could fail): static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir) { BTScanOpaque so = (BTScanOpaque) scan->opaque; BlockNumber blkno, lastcurrblkno; Assert(BTScanPosIsValid(so->currPos)); if (!so->dropPin) Assert(BTScanPosIsPinned(so->currPos)); else Assert(!BTScanPosIsPinned(so->currPos)); ... That way if there are remaining issues like the ones that Alexander reported, we don't rely on reaching _bt_killitems to get an assertion failure. (Note that adding these assertions necessitated making _bt_firstpage call _bt_readnextpage instead of its _bt_steppage wrapper. _bt_steppage is now "Only called when _bt_drop_lock_and_maybe_pin was called for so->currPos", which doesn't apply to the case where _bt_firstpage previously called _bt_steppage. This brings _bt_firstpage in line with _bt_readnextpage itself -- now we only call _bt_steppage for a page that we returned to the scan via btgettuple/a page that might have had its pin released by _bt_drop_lock_and_maybe_pin.) v2 also completely removes all IncrBufferRefCount() calls from nbtree. These were never really necessary, since we don't actually need to hold more than one pin at a time on the same page at any point in cases involving mark/restore (nor any other case). We can always get by with no more than one buffer pin on the same page. There were 2 IncrBufferRefCount() calls total, both of which v2 removes: The first IncrBufferRefCount call removed was in _bt_steppage. We copy so->currPos into so->markPos when leaving the page in _bt_steppage. We're invalidating so->currPos in passing here, so we don't *need* to play games with IncrBufferRefCount -- we can just not release the original pin, based on the understanding that so->markPos has the pin "transferred" to (so->currPos can therefore be invalidated without having its pin unpinned). The second IncrBufferRefCount call removed was in btrestrpos. When we restore a mark, we need to keep around the original mark, in case it needs to be restored a little later -- the so->markPos mark must not go away. But that in itself doesn't necessitate holding multiple buffer pins on the same page at once. We can keep the old mark around, without keeping a separate buffer pin for so->markPos. We can just set so->markItemIndex instead -- that's an alternative (and more efficient) way of representing the mark (the so->markPos representation isn't truly needed). This is okay, since our new so->currPos is for the same page as the mark -- that's how so->markItemIndex is supposed to be used. -- Peter Geoghegan
v2-0001-Fix-issue-with-mark-restore-nbtree-pins.patch
Description: Binary data