On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
<florisvan...@optiver.com> wrote:
> Anyone please correct me if I'm wrong, but I think one case where the current 
> patch relies on some data from the page it has locked before it in checking 
> this hi/lo key. I think it's possible for the following sequence to happen. 
> Suppose we have a very simple one leaf-page btree containing four elements: 
> leaf page 1 = [2,4,6,8]
> We do a backwards index skip scan on this and have just returned our first 
> tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in 
> and inserts a tuple (value 5) into this page, but suppose the page happens to 
> be full. So a page split occurs. As far as I know, a page split could happen 
> at any random element in the page. One of the situations we could be left 
> with is:
> Leaf page 1 = [2,4]
> Leaf page 2 = [5,6,8]
> However, our scan is still pointing to leaf page 1. For non-skip scans this 
> is not a problem, as we already read all matching elements in our local 
> buffer and we'll return those. But the skip scan currently:
> a) checks the lo-key of the page to see if the next prefix can be found on 
> the leaf page 1
> b) finds out that this is actually true
> c) does a search on the page and returns value=4 (while it should have 
> returned value=6)
>
> Peter, is my understanding about the btree internals correct so far?

This is a good summary. This is the kind of scenario I had in mind
when I expressed a general concern about "stopping between pages".
Processing a whole page at a time is a crucial part of how
_bt_readpage() currently deals with concurrent page splits.

Holding a buffer pin on a leaf page is only effective as an interlock
against VACUUM completely removing a tuple, which could matter with
non-MVCC scans.

> Now that I look at the patch again, I fear there currently may also be such a 
> dependency in the "Advance forward but read backward"-case. It saves the 
> offset number of a tuple in a variable, then does a _bt_search (releasing the 
> lock and pin on the page). At this point, anything can happen to the tuples 
> on this page - the page may be compacted by vacuum such that the offset 
> number you have in your variable does not match the actual offset number of 
> the tuple on the page anymore. Then, at the check for (nextOffset == 
> startOffset) later, there's a possibility the offsets are different even 
> though they relate to the same tuple.

If skip scan is restricted to heapkeyspace indexes (i.e. those created
on Postgres 12+), then it might be reasonable to save an index tuple,
and relocate it within the same page using a fresh binary search that
uses a scankey derived from the same index tuple -- without unsetting
scantid/the heap TID scankey attribute. I suppose that you'll need to
"find your place again" after releasing the buffer lock on a leaf page
for a time. Also, I think that this will only be safe with MVCC scans,
because otherwise the page could be concurrently deleted by VACUUM.

-- 
Peter Geoghegan


Reply via email to