On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <p...@bowt.ie> wrote: > On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee > <florisvan...@optiver.com> wrote: > > As far as I know, a page split could happen at any random element in the > > page. One of the situations we could be left with is: > > Leaf page 1 = [2,4] > > Leaf page 2 = [5,6,8] > > However, our scan is still pointing to leaf page 1. For non-skip scans this > > is not a problem, as we already read all matching elements in our local > > buffer and we'll return those. But the skip scan currently: > > a) checks the lo-key of the page to see if the next prefix can be found on > > the leaf page 1 > > b) finds out that this is actually true > > c) does a search on the page and returns value=4 (while it should have > > returned value=6) > > > > Peter, is my understanding about the btree internals correct so far? > > This is a good summary. This is the kind of scenario I had in mind > when I expressed a general concern about "stopping between pages". > Processing a whole page at a time is a crucial part of how > _bt_readpage() currently deals with concurrent page splits.
I want to be clear about what it means that the page doesn't have a "low key". Let us once again start with a very simple one leaf-page btree containing four elements: leaf page 1 = [2,4,6,8] -- just like in Floris' original page split scenario. Let us also say that page 1 has a left sibling page -- page 0. Page 0 happens to have a high key with the integer value 0. So you could *kind of* claim that the "low key" of page 1 is the integer value 0 (page 1 values must be > 0) -- *not* the integer value 2 (the so-called "low key" here is neither > 2, nor >= 2). More formally, an invariant exists that says that all values on page 1 must be *strictly* greater than the integer value 0. However, this formal invariant thing is hard or impossible to rely on when we actually reach page 1 and want to know about its lower bound -- since there is no "low key" pivot tuple on page 1 (we can only speak of a "low key" as an abstract concept, or something that works transitively from the parent -- there is only a physical high key pivot tuple on page 1 itself). Suppose further that Page 0 is now empty, apart from its "all values on page are <= 0" high key (page 0 must have had a few negative integer values in its tuples at some point, but not anymore). VACUUM will delete the page, *changing the effective low key* of Page 0 in the process. The lower bound from the shared parent page will move lower/left as a consequence of the deletion of page 0. nbtree page deletion makes the "keyspace move right, not left". So the "conceptual low key" of page 1 just went down from 0 to -5 (say), without there being any practical way of a skip scan reading page 1 noticing the change (the left sibling of page 0, page -1, has a high key of <= -5, say). Not only is it possible for somebody to insert the value 1 in page 1 -- now they can insert the value -3 or -4! More concretely, the pivot tuple in the parent that originally pointed to page 0 is still there -- all that page deletion changed about this tuple is its downlink, which now points to page 1 instead or page 0. Confusingly, page deletion removes the pivot tuple of the right sibling page from the parent -- *not* the pivot tuple of the empty page that gets deleted (in this case page 0) itself. Note: this example ignores things like negative infinity values in truncated pivot tuples, and the heap TID tiebreaker column -- in reality this would look a bit different because of those factors. See also: amcheck's bt_right_page_check_scankey() function, which has a huge comment that reasons about a race involving page deletion. In general, page deletion is by far the biggest source of complexity when reasoning about the key space. -- Peter Geoghegan