Hi Dmitry,

> > On Wed, Jan 22, 2020 at 07:50:30AM +0000, Floris Van Nee wrote:
> >
> > Anyone please correct me if I'm wrong, but I think one case where the 
> > current patch relies on some data from the page it has locked before it in 
> > checking this hi/lo key. I think it's possible for the following sequence 
> > to happen. Suppose we have a very simple one leaf-page btree containing 
> > four elements: leaf page 1 = [2,4,6,8]
> > We do a backwards index skip scan on this and have just returned our first 
> > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes 
> > in and inserts a tuple (value 5) into this page, but suppose the page 
> > happens to be full. So a page split occurs. As far as I know, a page split 
> > could happen at any random element in the page. One of the situations we 
> > could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1.

> In case if we just returned a tuple, the next action would be either
>> check the next page for another key or search down to the tree. Maybe

But it won't look at the 'next page for another key', but rather at the 'same 
page or another key', right? In the _bt_scankey_within_page shortcut we're 
taking, there's no stepping to a next page involved. It just locks the page 
again that it previously also locked.

> I'm missing something in your scenario, but the latter will land us on a
> required page (we do not point to any leaf here), and before the former
> there is a check for high/low key. Is there anything else missing?

Let me try to clarify. After we return the first tuple, so->currPos.buf is 
pointing to page=1 in my example (it's the only page after all). We've returned 
item=8. Then the split happens and the items get rearranged as in my example. 
We're still pointing with so->currPos.buf to page=1, but the page now contains 
[2,4]. The split happened to the right, so there's a page=2 with [5,6,8], 
however the ongoing index scan is unaware of that.
Now _bt_skip gets called to fetch the next tuple. It starts by checking 
_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir), the 
result of which will be 'true': we're comparing the skip key to the low key of 
the page. So it thinks the next key can be found on the current page. It locks 
the page and does a _binsrch, finding item=4 to be returned.

The problem here is that _bt_scankey_within_page mistakenly returns true, 
thereby limiting the search to just the page that it's pointing to already.
It may be fine to just fix this function to return the proper value (I guess 
it'd also need to look at the high key in this example). It could also be fixed 
by not looking at the lo/hi key of the page, but to use the local tuple buffer 
instead. We already did a _read_page once, so if we have any matching tuples on 
that specific page, we have them locally in the buffer already. That way we 
never need to lock the same page twice.


Reply via email to