On Tue, Jul 22, 2014 at 10:49 PM, Peter Geoghegan <p...@heroku.com> wrote: >> Basically I think it will be better if you can explain in bit more detail >> that >> how does "right-links at all levels and high-key" helps to detect and >> recover from concurrent page splits. > > You might be right about that - perhaps I should go into more detail.
I've gone into more detail on what a high key is, and how we arguably do not follow Lehman & Yao to the letter because we still have read locks. L&Y's assumption of atomic page reads/writes, and cursory handling of deletion kind of make it inevitable that read locks are used, which I now imply. So nbtree isn't a substandard L&Y implementation - it's a realistic one, which only needs to hold a single read lock at a time when servicing index scans (or when finding a place for insertion). I guess Lehman & Yao preferred to put forward the claim "no read locks" rather than "only one read lock at a time on internal pages, even for insertion" because there might be some uses of their algorithm where that is actually realistic. It is really a sympathetic way of spinning things to say that *no* read locks are used, though. -- Peter Geoghegan
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README new file mode 100644 index 4820f76..8285050 *** a/src/backend/access/nbtree/README --- b/src/backend/access/nbtree/README *************** use a simplified version of the deletion *** 11,16 **** --- 11,50 ---- Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm, Proceedings of 1986 Fall Joint Computer Conference, pp 380-389). + Lehman and Yao don't require read locks, but assume that in-memory + copies of tree pages are unshared. Postgres shares in-memory buffers + among backends. As a result, we do page-level read locking on btree + pages in order to guarantee that no record is modified while we are + examining it. This reduces concurrency but guarantees correct + behavior. An advantage is that when trading in a read lock for a + write lock, we need not re-read the page after getting the write lock. + Since we're also holding a pin on the shared buffer containing the + page, we know that buffer still contains the page and is up-to-date. + + Although it could be argued that Lehman and Yao isn't followed to the + letter because single pages are read locked as the tree is descended, + this is at least necessary to support deletion, a common requirement + which L&Y hardly acknowledge. Read locks also ensure that B-tree + pages are self-consistent (L&Y appear to assume atomic page reads and + writes). Even with these read locks, following L&Y obviates the need + of earlier schemes to hold multiple locks concurrently when descending + the tree as part of servicing index scans (pessimistic lock coupling). + The addition of right-links at all levels, as well as the addition of + a page "high key" allows detection and dynamic recovery from + concurrent page splits (that is, splits between unlocking an internal + page, and subsequently locking its child page during a descent). When + a page is first locked (at every level of a descent servicing an index + scan), we consider the need to "move right": if the scankey value is + less than (or sometimes less than or equal to) the page's existing + highkey value, a value which serves as an upper bound for values on + the page generally, then it must be necessary to move the scan to the + right-hand page on the same level. It's even possible that the scan + needs to move right more than once. Once the other session's + concurrent page split finishes, a downlink will be inserted into the + parent, and so assuming there are no further page splits, future index + scans using the same scankey value will not need to move right. L&Y + Trees are sometimes referred to as "B-Link trees" in the literature. + The Lehman and Yao Algorithm and Insertions ------------------------------------------- *************** to be inserted has a choice whether or n *** 42,57 **** key could go on either page. (Currently, we try to find a page where there is room for the new key without a split.) - Lehman and Yao don't require read locks, but assume that in-memory - copies of tree pages are unshared. Postgres shares in-memory buffers - among backends. As a result, we do page-level read locking on btree - pages in order to guarantee that no record is modified while we are - examining it. This reduces concurrency but guarantees correct - behavior. An advantage is that when trading in a read lock for a - write lock, we need not re-read the page after getting the write lock. - Since we're also holding a pin on the shared buffer containing the - page, we know that buffer still contains the page and is up-to-date. - We support the notion of an ordered "scan" of an index as well as insertions, deletions, and simple lookups. A scan in the forward direction is no problem, we just use the right-sibling pointers that --- 76,81 ----
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers