Re: Index Skip Scan

Peter Geoghegan Wed, 22 Jan 2020 16:31:03 -0800

On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <p...@bowt.ie> wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
> <florisvan...@optiver.com> wrote:
> > As far as I know, a page split could happen at any random element in the 
> > page. One of the situations we could be left with is:
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this 
> > is not a problem, as we already read all matching elements in our local 
> > buffer and we'll return those. But the skip scan currently:
> > a) checks the lo-key of the page to see if the next prefix can be found on 
> > the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have 
> > returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.


I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
-- 
Peter Geoghegan

Re: Index Skip Scan

Reply via email to