On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee <florisvan...@optiver.com> wrote: > Anyone please correct me if I'm wrong, but I think one case where the current > patch relies on some data from the page it has locked before it in checking > this hi/lo key. I think it's possible for the following sequence to happen. > Suppose we have a very simple one leaf-page btree containing four elements: > leaf page 1 = [2,4,6,8] > We do a backwards index skip scan on this and have just returned our first > tuple (8). The buffer is left pinned but unlocked. Now, someone else comes in > and inserts a tuple (value 5) into this page, but suppose the page happens to > be full. So a page split occurs. As far as I know, a page split could happen > at any random element in the page. One of the situations we could be left > with is: > Leaf page 1 = [2,4] > Leaf page 2 = [5,6,8] > However, our scan is still pointing to leaf page 1. For non-skip scans this > is not a problem, as we already read all matching elements in our local > buffer and we'll return those. But the skip scan currently: > a) checks the lo-key of the page to see if the next prefix can be found on > the leaf page 1 > b) finds out that this is actually true > c) does a search on the page and returns value=4 (while it should have > returned value=6) > > Peter, is my understanding about the btree internals correct so far?
This is a good summary. This is the kind of scenario I had in mind when I expressed a general concern about "stopping between pages". Processing a whole page at a time is a crucial part of how _bt_readpage() currently deals with concurrent page splits. Holding a buffer pin on a leaf page is only effective as an interlock against VACUUM completely removing a tuple, which could matter with non-MVCC scans. > Now that I look at the patch again, I fear there currently may also be such a > dependency in the "Advance forward but read backward"-case. It saves the > offset number of a tuple in a variable, then does a _bt_search (releasing the > lock and pin on the page). At this point, anything can happen to the tuples > on this page - the page may be compacted by vacuum such that the offset > number you have in your variable does not match the actual offset number of > the tuple on the page anymore. Then, at the check for (nextOffset == > startOffset) later, there's a possibility the offsets are different even > though they relate to the same tuple. If skip scan is restricted to heapkeyspace indexes (i.e. those created on Postgres 12+), then it might be reasonable to save an index tuple, and relocate it within the same page using a fresh binary search that uses a scankey derived from the same index tuple -- without unsetting scantid/the heap TID scankey attribute. I suppose that you'll need to "find your place again" after releasing the buffer lock on a leaf page for a time. Also, I think that this will only be safe with MVCC scans, because otherwise the page could be concurrently deleted by VACUUM. -- Peter Geoghegan