On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > + * Ideally, we'd compare every item in the index against every other > + * item in the index, and not trust opclass obedience of the transitive > + * law to bridge the gap between children and their grandparents (as > + * well as great-grandparents, and so on). We don't go to those > + * lengths because that would be prohibitively expensive, and probably > + * not markedly more effective in practice. > + */ > > I skimmed the Modern B-Tree Techniques paper recently, and there was a > section on "the cousin problem" when verifying btrees, which this > comment reminded me of.
I'm impressed that you made the connection. This comment reminded you of that section because it was directly influenced by it. I actually use the term "cousin page" elsewhere, because it's really useful to distinguish cousins from "true siblings" when talking about page deletion in particular...and I have a lot to say about page deletion. > I tried to come up with an example of a pair > of characters being switched in a collation that would introduce > undetectable corruption: > > T1. Order = a < b < c < d > > Btree = [a|c] > / \ > / \ > / \ > [a]-------[c] > | | > | | > [b]-------[d] > > T2. Order = a < c < b < d So, the Btree you show is supposed to be good? You've left it up to me to draw T2, the corrupt version? If so, here is a drawing of T2 for the sake of clarity: Btree = [a|b] / \ / \ / \ [a]-------[b] | | | | [c]-------[d] > Now c and b have been switched around. Won't amcheck still pass since > we only verify immediate downlinks and siblings? Yet searches for b > will take a wrong turn at the root. Do I have this right? I wonder > if there is a way to verify that each page's high key < the 'next' key > for all ancestors. I don't think you have it right, because your argument is predicated on classic L&Y, not the Postgres variant. The high key is literally guaranteed to be *equal* to the initial first value of the right half of a split page post-split (at least initially), which is where the new downlink for the new right half comes from in turn, so it will be exactly equal as well. There is a comment which says all this within _bt_insert_parent(), actually. In general, I think it's pretty bad that we're so loose about representing the key space in downlinks compared to classic L&Y, because it probably makes index bloat a lot worse in internal pages, messing up the key space long term. As long as we're doing that, though, we clearly cannot usefully "verify that each page's high key < the 'next' key for all ancestors", because we deviate from L&Y in a way that makes that check always fail. It's okay that it always fails for us (it doesn't actually indicate corruption), because of our duplicate-aware index scan logic (change to invariant). In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the cousin problem a non-issue, because we don't follow downlinks that are equal to our scankey -- we go to their immediate left and start *there*, in order to support duplicates in leaf pages. Index scans for 'b' work for both T1 and T2 in Postgres, because we refuse to follow the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I would actually like to change things to make the invariant the classic L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make suffix truncation in internal pages work. So, we don't have the cousin problem, but since I wish for us to adopt the stricter classic L&Y invariant at some point, you could say that I also wished that we had the cousin problem. Confused yet? :-) [1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers