On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <p...@bowt.ie> wrote:
> I've always thought that the whole idiom of testing
> BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
> it makes sense to totally replace those calls/tests with similar tests
> of the new so->dropPin field.

It's definitely possible to avoid testing BTScanPosIsPinned like this
entirely. Attached v2 patch shows how this is possible.

v2 fully removes BTScanPosUnpinIfPinned, and limits the use of
BTScanPosIsPinned to assertions. I think that this approach makes our
buffer pin resource management strategy much clearer, particularly in
cases involving mark/restore. This might help with Tomas Vondra's
index prefetching patch -- I know that Tomas found the mark/restore
aspects of his latest approach challenging.

In v2, I added an assertion at the top of _bt_steppage that
independently verify the same conditions to the ones that are already
tested whenever _bt_steppage calls _bt_killitems (these include the
assertion that Alexander showed could fail):

static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
    BTScanOpaque so = (BTScanOpaque) scan->opaque;
    BlockNumber blkno,
                lastcurrblkno;

    Assert(BTScanPosIsValid(so->currPos));
    if (!so->dropPin)
        Assert(BTScanPosIsPinned(so->currPos));
    else
        Assert(!BTScanPosIsPinned(so->currPos));
...

That way if there are remaining issues like the ones that Alexander
reported, we don't rely on reaching _bt_killitems to get an assertion
failure.

(Note that adding these assertions necessitated making _bt_firstpage
call _bt_readnextpage instead of its _bt_steppage wrapper.
_bt_steppage is now "Only called when _bt_drop_lock_and_maybe_pin was
called for so->currPos", which doesn't apply to the case where
_bt_firstpage previously called _bt_steppage. This brings
_bt_firstpage in line with _bt_readnextpage itself -- now we only call
_bt_steppage for a page that we returned to the scan via btgettuple/a
page that might have had its pin released by
_bt_drop_lock_and_maybe_pin.)

v2 also completely removes all IncrBufferRefCount() calls from nbtree.
These were never really necessary, since we don't actually need to
hold more than one pin at a time on the same page at any point in
cases involving mark/restore (nor any other case). We can always get
by with no more than one buffer pin on the same page.

There were 2 IncrBufferRefCount() calls total, both of which v2 removes:

The first IncrBufferRefCount call removed was in _bt_steppage. We copy
so->currPos into so->markPos when leaving the page in _bt_steppage.
We're invalidating so->currPos in passing here, so we don't *need* to
play games with IncrBufferRefCount -- we can just not release the
original pin, based on the understanding that so->markPos has the pin
"transferred" to (so->currPos can therefore be invalidated without
having its pin unpinned).

The second IncrBufferRefCount call removed was in btrestrpos. When we
restore a mark, we need to keep around the original mark, in case it
needs to be restored a little later -- the so->markPos mark must not
go away. But that in itself doesn't necessitate holding multiple
buffer pins on the same page at once. We can keep the old mark around,
without keeping a separate buffer pin for so->markPos. We can just set
so->markItemIndex instead -- that's an alternative (and more
efficient) way of representing the mark (the so->markPos
representation isn't truly needed). This is okay, since our new
so->currPos is for the same page as the mark -- that's how
so->markItemIndex  is supposed to be used.

-- 
Peter Geoghegan

Attachment: v2-0001-Fix-issue-with-mark-restore-nbtree-pins.patch
Description: Binary data

Reply via email to