On Tue, Feb 18, 2020 at 2:16 AM Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > Great, thank you very much!
No problem! My remarks here are based on "amcheck-btree-improve-missing-parent-downlinks-check-6.patch". I have found a false positive corruption report bug in this latest version -- see note below about incomplete page splits. > > Don't need the "!P_ISLEAF()" here. > > Why don't I need. bt_downlink_connectivity_check() checks one level > down to the target level. But there is no one level down to leaf... Because offset_is_negative_infinity() checks P_ISLEAF() for you. Maybe it's better your way, though -- apparently it's clearer. > > > Alternatively > > > + * we might find child with high key while traversing from > > > + * previous downlink to current one. Then matching key > > > resides > > > + * the same offset number as current downlink. > > > + */ > > > > Not sure what "traversing from previous downlink to current one" means at > > all. > > I've rephrased this comment, please check. > Agree, replaced _bt_compare() with bt_pivot_tuple_identical(). It > becomes even simpler now, thanks! There was actually an even better reason to invent bt_pivot_tuple_identical(): a call to _bt_compare() in amcheck needs to do something like the extra steps that you see in routines like invariant_l_offset(). _bt_compare() will return 0 when the insertion scankey has a prefix of scankey/column values that are equal, even though there may be additional columns in the index tuple that are not compared. So, you could have a truncated multi-column high key that is "equal" to pivot tuple in parent that is actually to the right in the key space. This blind spot would often occur with low cardinality indexes, where we often have something like this in pivot tuples on internal pages: 'foo, -inf' 'foo, (1,24)' 'food, -inf'. <-- This pivot tuple's downlink points to the final leaf page that's filled with duplicates of the value 'foo' 'food, (3,19)' <-- This pivot tuple's downlink points to the *first* leaf page that's filled with duplicates of the value 'food' ... The situation is really common in low cardinality indexes because nbtsplitloc.c hates splitting a leaf page between two duplicates -- it is avoided whenever possible. You reliably get a '-inf' value for the TID in the first pivot tuple for the duplicate, followed by a real heap TID for later pivot tuples for pages with the same duplicate value. (Anyway, it's not important now.) > > I think bt_downlink_check() and bt_downlink_connectivity_check() > > should be renamed to something broader. In my mind, downlink is > > basically a block number. We have been sloppy about using the term > > downlink when we really mean "pivot tuple with a downlink" -- I am > > guilty of this myself. But it seems more important, now that you have > > the new high key check. > > Hmm... Names are hard for me. I didn't do any renaming for now. What > about this? > bt_downlink_check() => bt_child_check() > bt_downlink_connectivity_check() => bt_child_connectivity_highkey_check() I suggest: bt_downlink_check() => bt_child_check() bt_downlink_connectivity_check() => bt_child_highkey_check() While bt_downlink_connectivity_check() moves right on the target's child level, this isn't the common case. Moving right like that doesn't need to be suggested by the name of the function. Most of the time, we just check the high key -- right? > I've revised finding the matching pivot key for high key. Now, code > assumes we should always find a matching pivot key. It could use both > target high key or left sibling high key (which is memorized as "low > key"). > > I've checked this on normal indexes, but I didn't try to exercise this > with broken indexes. I would appreciate if you do. I can confirm that checking the high key in target's child page against the high key in target (when appropriate) removes the "cousin page verification blind spot" that I noticed in the last version, as expected. Great! * You should say "target page lsn" here instead: pg@tpce:5432 [19852]=# select bt_index_parent_check(:'idxrelation',true, true); ERROR: mismatch between parent key and child high key in index "i_holding2" DETAIL: Target block=1570 child block=1690 parent page lsn=0/0. Time: 12.509 ms * Maybe say "Move to the right on the child level" in a comment above the bt_downlink_connectivity_check() "while (true)" loop here: > + > + while (true) > + { > + /* > + * Did we traverse the whole tree level and this is check for pages to > + * the right of rightmost downlink? > + */ * If you are going to save a low key for the target page in memory, then you only need to do so for "state->readonly"/parent verification. * You should s/lokey/lowkey/ -- I prefer the spelling "lowkey" or "low key". This is a term that nbtsort.c now uses, in case you didn't know. * The reason for saving a low key for each target page is very unclear. Can we fix this? The low key hardly ever gets used -- I can comment out the code that uses a low key within bt_downlink_connectivity_check(), and indexes that I tested appear fine. So I imagine this is for incomplete split cases -- right? I tested incomplete split cases using this simple hack: diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 4e5849ab8e..5811338584 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1821,7 +1821,7 @@ _bt_insert_parent(Relation rel, */ _bt_relbuf(rel, rbuf); - if (pbuf == InvalidBuffer) + if (random() <= (MAX_RANDOM_VALUE / 100)) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u", Now, 1% of all page splits will fail after the first phase finishes, but before the second phase begins/finishes. The incomplete split flag will be found set by other, future inserts, though, at which point the split will be finished (unless we're unlucky again). An INSERT that inserts many rows to bound to fail with this hack applied, but the INSERT shouldn't leave the index in a state that looks like corruption to amcheck. I can see false positive reports of corruption like this, though. Consider the following sample session. The session inserts data from one table into another table with an equivalent schema. As you can see, I have to get my INSERT to fail three times before I see the problem: pg@tpce:5432 [13026]=# truncate holding2; TRUNCATE TABLE pg@tpce:5432 [13026]=# insert into holding2 select * from holding; ERROR: failed to re-find parent key in index "pk_holding2" for split pages 83/158 pg@tpce:5432 [13026]=# select bt_index_parent_check('i_holding2',true, true); DEBUG: verifying level 1 (true root level) DEBUG: verifying 200 items on internal block 3 DEBUG: verifying level 0 (leaf level) DEBUG: verifying 262 items on leaf block 1 DEBUG: verifying 258 items on leaf block 2 *** SNIP *** DEBUG: verifying 123 items on leaf block 201 DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" DEBUG: finished verifying presence of 0 tuples from table "holding2" with bitset 4.83% set ┌───────────────────────┐ │ bt_index_parent_check │ <-- No false positive yet ├───────────────────────┤ │ │ └───────────────────────┘ (1 row) pg@tpce:5432 [13026]=# insert into holding2 select * from holding; DEBUG: finishing incomplete split of 83/158 ERROR: failed to re-find parent key in index "i_holding2" for split pages 435/436 pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); DEBUG: verifying level 2 (true root level) DEBUG: verifying 2 items on internal block 303 *** SNIP *** DEBUG: verifying 74 items on leaf block 436 DEBUG: verifying that tuples from index "i_holding2" are present in "holding2" DEBUG: finished verifying presence of 0 tuples from table "holding2" with bitset 9.97% set ┌───────────────────────┐ │ bt_index_parent_check │ <-- No false positive yet ├───────────────────────┤ │ │ └───────────────────────┘ (1 row) pg@tpce:5432 [13026]=# insert into holding2 select * from holding; ERROR: failed to re-find parent key in index "i_holding2" for split pages 331/577 pg@tpce:5432 [13026]=# select bt_index_parent_check(:'idxrelation',true, true); DEBUG: verifying level 2 (true root level) DEBUG: verifying 3 items on internal block 303 DEBUG: verifying level 1 DEBUG: verifying 151 items on internal block 3 DEBUG: verifying 189 items on internal block 521 DEBUG: verifying 233 items on internal block 302 WARNING: 577 rightsplit ERROR: leaf index block lacks downlink in index "i_holding2". <-- False positive "corruption" DETAIL: Block=127 page lsn=0/0. Notice I added a custom WARNING here, and we get the false positive error just after the point that the WARNING is seen. This WARNING is triggered by the temporary instrumentation change that I added to amcheck, and can be seen in some cases that don't have an accompanying false positive report of corruption: diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index d6cb10e263..dcfee9e4fa 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -1654,6 +1654,8 @@ bt_downlink_connectivity_check(BtreeCheckState *state, errmsg("circular link chain found in block %u of index \"%s\"", blkno, RelationGetRelationName(state->rel)))); + if (rightsplit) + elog(WARNING, "%u rightsplit", blkno); if (!first && !P_IGNORE(opaque)) { /* blkno probably has missing parent downlink */ *** END DIFF *** Anyway, I didn't take the time to debug this today, but I suspect that the problem is that the lowkey thing is kind of brittle. If you can't figure it out, let me know and I'll help with it. * Can we reverse the order here, so that the common case (i.e. the offset_is_negative_infinity() case) comes first?: > + if (offset_is_negative_infinity(topaque, pivotkey_offset)) > + { > + /* > + * We're going to try match child high key to "negative > + * infinity key". That means we should match to high key of > + * left sibling of target page. > + */ *** SNIP *** > + } > + else > + { *** SNIP *** > + } * Can the offset_is_negative_infinity() branch explain what's going on at a much higher level than this? -- Peter Geoghegan