Re: amcheck's verify_heapam(), and HOT chain verification

Peter Geoghegan Sat, 06 Nov 2021 13:03:58 -0700

On Fri, Nov 5, 2021 at 8:18 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote:
> Thanks, Peter, for this analysis.

It's strategically important work. What you've done so far has every
chance of catching corruption involving storage issues, which is
great. But we ought to do more to catch certain known general patterns
of corruption. The so-called "freeze the dead" bug (see commit
9c2f0a6c for the final fix) is the single weirdest and scariest bug
that I can recall (there were a couple of incorrect bug fixes that had
to be reverted), and it is very much the kind of thing that I'd like
to see verify_heapam.c handle exhaustively. That would de-risk a lot
of things. Note that the same variety of HOT chain corruption bugs
besides that one, so it's really a general class of corruption, not
one specific bug.

The heapallindexed verification option actually detected the "freeze
the dead" bug [1], although that was a total accident -- that checking
option happens to use CREATE INDEX infrastructure, which happens to
sanitize HOT chains by calling heap_get_root_tuples() and detecting
contradictions -- heap_get_root_tuples() is code that is right next to
the pruning code I mentioned. That level of detection is certainly
better than nothing, but it's still not good enough -- we need this in
verify_heapam, too.

Here's why I think that we need very thorough HOT chain verification
that lives directly in verify_heapam:

First of all, the heapallindexed option is rather expensive, whereas
the additional overhead for verify_heapam would be minimal -- you
probably wouldn't even need to make it optional. Second of all, why
should you need to use bt_index_check() to detect what is after all
heap corruption? And third of all, the approach of detecting
corruption by outsourcing detection to heapam_index_build_range_scan()
(which calls heap_get_root_tuples() itself) probably risks missing
corrupt HOT chains where the corruption doesn't look a certain way --
it was not designed with amcheck in mind.

heapam_index_build_range_scan() + heap_get_root_tuples() will notice a
heap-only tuple without a parent, which is a good start. But what
about an LP_REDIRECT item whose lp_off (i.e. page offset number
redirect) somehow points to any item on the page other than a valid
heap-only tuple? Also doesn't catch contradictory commit states for
heap-only tuples that form a hot chain through their t_ctid links. I'm
sure that there are other things that I haven't thought of, too --
there is a lot going on here.

This HOT chain verification business is a little like the checking
that we do in verify_nbtree.c, actually. The individual tuples (in
this case heap tuples) may well not be corrupt (or demonstrably
corrupt) in isolation. But taken together, in a particular page offset
number sequence order, they can nevertheless *imply* corruption --
it's implied if the tuples tell a contradictory story about what's
going on at the level of the whole page. A HOT chain LP_REDIRECT is
after all a little like a "mini index" stored on a heap page.
Sequential scans don't care about LP_REDIRECTs at all. But index scans
expect a great deal from LP_REDIRECTs.

[1]
https://www.postgresql.org/message-id/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com
--
Peter Geoghegan

Re: amcheck's verify_heapam(), and HOT chain verification

Reply via email to