Re: new heapcheck contrib module

Amul Sul Mon, 20 Jul 2020 22:29:43 -0700

Hi Mark,

I think new structures should be listed in src/tools/pgindent/typedefs.list,
otherwise, pgindent might disturb its indentation.


Regards,
Amul


On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
<[email protected]> wrote:
>
>
>
> > On Jul 16, 2020, at 12:38 PM, Robert Haas <[email protected]> wrote:
> >
> > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <[email protected]> 
> > wrote:
> >> The v10 patch without these ideas is here:
> >
> > Along the lines of what Alvaro was saying before, I think this
> > definitely needs to be split up into a series of patches. The commit
> > message for v10 describes it doing three pretty separate things, and I
> > think that argues for splitting it into a series of three patches. I'd
> > argue for this ordering:
> >
> > 0001 Refactoring existing amcheck btree checking functions to optionally
> > return corruption information rather than ereport'ing it.  This is
> > used by the new pg_amcheck command line tool for reporting back to
> > the caller.
> >
> > 0002 Adding new function verify_heapam for checking a heap relation and
> > associated toast relation, if any, to contrib/amcheck.
> >
> > 0003 Adding new contrib module pg_amcheck, which is a command line
> > interface for running amcheck's verifications against tables and
> > indexes.
> >
> > It's too hard to review things like this when it's all mixed together.
>
> The v11 patch series is broken up as you suggest.
>
> > +++ b/contrib/amcheck/t/skipping.pl
> >
> > The name of this file is inconsistent with the tree's usual
> > convention, which is all stuff like 001_whatever.pl, except for
> > src/test/modules/brin, which randomly decided to use two digits
> > instead of three. There's no precedent for a test file with no leading
> > numeric digits. Also, what does "skipping" even have to do with what
> > the test is checking? Maybe it's intended to refer to the new error
> > handling "skipping" the actual error in favor of just reporting it
> > without stopping, but that's not really what the word "skipping"
> > normally means. Finally, it seems a bit over-engineered: do we really
> > need 183 test cases to check that detecting a problem doesn't lead to
> > an abort? Like, if that's the purpose of the test, I'd expect it to
> > check one corrupt relation and one non-corrupt relation, each with and
> > without the no-error behavior. And that's about it. Or maybe it's
> > talking about skipping pages during the checks, because those pages
> > are all-visible or all-frozen? It's not very clear to me what's going
> > on here.
>
> The "skipping" did originally refer to testing verify_heapam()'s option to 
> skip all-visible or all-frozen blocks.  I have renamed it 
> 001_verify_heapam.pl, since it tests that function.
>
> >
> > + TransactionId nextKnownValidXid;
> > + TransactionId oldestValidXid;
> >
> > Please add explanatory comments indicating what these are intended to
> > mean.
>
> Done.
>
> > For most of the the structure members, the brief comments
> > already present seem sufficient; but here, more explanation looks
> > necessary and less is provided. The "Values for returning tuples"
> > could possibly also use some more detail.
>
> Ok, I've expanded the comments for these.
>
> > +#define HEAPCHECK_RELATION_COLS 8
> >
> > I think this should really be at the top of the file someplace.
> > Sometimes people have adopted this style when the #define is only used
> > within the function that contains it, but that's not the case here.
>
> Done.
>
> >
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > + errhint("please choose from 'all visible', 'all frozen', "
> > + "or NULL")));
> >
> > I think it would be better if we had three string values selecting the
> > different behaviors, and made the parameter NOT NULL but with a
> > default. It seems like that would be easier to understand. Right now,
> > I can tell that my options for what to skip are "all visible", "all
> > frozen", and, uh, some other thing that I don't know what it is. I'm
> > gonna guess the third option is to skip nothing, but it seems best to
> > make that explicit. Also, should we maybe consider spelling this
> > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > Spaces in an option value seems a little icky to me somehow.
>
> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults 
> to 'none'.  I did not mark the function as strict, as I think NULL is a 
> reasonable value (and the default) for startblock and endblock.
>
> > + int64 startblock = -1;
> > + int64 endblock = -1;
> > ...
> > + if (!PG_ARGISNULL(3))
> > + startblock = PG_GETARG_INT64(3);
> > + if (!PG_ARGISNULL(4))
> > + endblock = PG_GETARG_INT64(4);
> > ...
> > + if (startblock < 0)
> > + startblock = 0;
> > + if (endblock < 0 || endblock > ctx.nblocks)
> > + endblock = ctx.nblocks;
> > +
> > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> >
> > So, the user can specify a negative value explicitly and it will be
> > treated as the default, and an endblock value that's larger than the
> > relation size will be treated as the relation size. The way pg_prewarm
> > does the corresponding checks seems superior: null indicates the
> > default value, and any non-null value must be within range or you get
> > an error. Also, you seem to be treating endblock as the first block
> > that should not be checked, whereas pg_prewarm takes what seems to me
> > to be the more natural interpretation: the end block is the last block
> > that IS checked. If you do it this way, then someone who specifies the
> > same start and end block will check no blocks -- silently, I think.
>
> Under that regime, for relations with one block of data, (startblock=0, 
> endblock=0) means "check the zero'th block", and for relations with no blocks 
> of data, specifying any non-null (startblock,endblock) pair raises an 
> exception.  I don't like that too much, but I'm happy to defer to precedent.  
> Since you say pg_prewarm works this way (I did not check), I have changed 
> verify_heapam to do likewise.
>
> > +               if (skip_all_frozen || skip_all_visible)
> >
> > Since you can't skip all frozen without skipping all visible, this
> > test could be simplified. Or you could introduce a three-valued enum
> > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > better.
>
> It works now with a three-valued enum.
>
> > + /* We must unlock the page from the prior iteration, if any */
> > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> >
> > I don't understand this assertion, and I don't understand the comment,
> > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > because we never set it to anything outside the range of 0..(endblock
> > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > because we just initialized it by calling ReadBufferExtended(). So I
> > think this assertion would still pass if we wrote && rather than ||.
> > But even then, I don't know what that has to do with the comment or
> > why it even makes sense to have an assertion for that in the first
> > place.
>
> Yes, it is vestigial.  Removed.
>
> > +       /*
> > +        * Open the relation.  We use ShareUpdateExclusive to prevent 
> > concurrent
> > +        * vacuums from changing the relfrozenxid, relminmxid, or advancing 
> > the
> > +        * global oldestXid to be newer than those.  This protection
> > saves us from
> > +        * having to reacquire the locks and recheck those minimums for 
> > every
> > +        * tuple, which would be expensive.
> > +        */
> > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> >
> > I don't think we'd need to recheck for every tuple, would we? Just for
> > cases where there's an apparent violation of the rules.
>
> It's a bit fuzzy what an "apparent violation" might be if both ends of the 
> range of valid xids may be moving, and arbitrarily much.  It's also not clear 
> how often to recheck, since you'd be dealing with a race condition no matter 
> how often you check.  Perhaps the comments shouldn't mention how often you'd 
> have to recheck, since there is no really defensible choice for that.  I 
> removed the offending sentence.
>
> > I guess that
> > could still be expensive if there's a lot of them, but needing
> > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > unfortunate.
>
> I welcome strategies that would allow for taking a lesser lock.
>
> > It's also unclear to me why this concerns itself with relfrozenxid and
> > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > like if we're going to sanity-check the relfrozenxid against the
> > cluster-wide value, we ought to also check it against the
> > database-wide value. Checking neither would also seem like a plausible
> > choice. But it seems very strange to only check against the
> > cluster-wide value.
>
> If the relation has a normal relfrozenxid, then the oldest valid xid we can 
> encounter in the table is relfrozenxid.  Otherwise, each row needs to be 
> compared against some other minimum xid value.
>
> Logically, that other minimum xid value should be the oldest valid xid for 
> the database, which must logically be at least as old as any valid row in the 
> table and no older than the oldest valid xid for the cluster.
>
> Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be 
> believed, and if I am reading them correctly, the stored value for the oldest 
> valid xid in the database has been known to be corrupted by bugs in 
> pg_upgrade.  This is awful.  If I compare the xid of a row in a table against 
> the oldest xid value for the database, and the xid of the row is older, what 
> can I do?  I don't have a principled basis for determining which one of them 
> is wrong.
>
> The logic in verify_heapam is conservative; it makes no guarantees about 
> finding and reporting all corruption, but if it does report a row as corrupt, 
> you can bank on that, bugs in verify_heapam itself not withstanding.  I think 
> this is a good choice; a tool with only false negatives is much more useful 
> than one with both false positives and false negatives.
>
> I have added a comment about my reasoning to verify_heapam.c.  I'm happy to 
> be convinced of a better strategy for handling this situation.
>
> >
> > +               StaticAssertStmt(InvalidOffsetNumber + 1 == 
> > FirstOffsetNumber,
> > +                                                "InvalidOffsetNumber
> > increments to FirstOffsetNumber");
> >
> > If you are going to rely on this property, I agree that it is good to
> > check it. But it would be better to NOT rely on this property, and I
> > suspect the code can be written quite cleanly without relying on it.
> > And actually, that's what you did, because you first set ctx.offnum =
> > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > the loop initializer. So AFAICS the first initializer, and the static
> > assert, are pointless.
>
> Ah, right you are.  Removed.
>
> >
> > +                       if (ItemIdIsRedirected(ctx.itemid))
> > +                       {
> > +                               uint16 redirect = 
> > ItemIdGetRedirect(ctx.itemid);
> > +                               if (redirect <= SizeOfPageHeaderData
> > || redirect >= ph->pd_lower)
> > ...
> > +                               if ((redirect - SizeOfPageHeaderData)
> > % sizeof(uint16))
> >
> > I think that ItemIdGetRedirect() returns an offset, not a byte
> > position. So the expectation that I would have is that it would be any
> > integer >= 0 and <= maxoff. Am I confused?
>
> I think you are right about it returning an offset, which should be between 
> FirstOffsetNumber and maxoff, inclusive.  I have updated the checks.
>
> > BTW, it seems like it might
> > be good to complain if the item to which it points is LP_UNUSED...
> > AFAIK that shouldn't happen.
>
> Thanks for mentioning that.  It now checks for that.
>
> > +                                errmsg("\"%s\" is not a heap AM",
> >
> > I think the correct wording would be just "is not a heap." The "heap
> > AM" is the thing in pg_am, not a specific table.
>
> Fixed.
>
> > +confess(HeapCheckContext * ctx, char *msg)
> > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> >
> > This is what happens when you pgindent without adding all the right
> > things to typedefs.list first ... or when you don't pgindent and have
> > odd ideas about how to indent things.
>
> Hmm.  I don't see the three lines of code you are quoting.  Which patch is 
> that from?
>
> >
> > +       /*
> > +        * In principle, there is nothing to prevent a scan over a large, 
> > highly
> > +        * corrupted table from using workmem worth of memory building up 
> > the
> > +        * tuplestore.  Don't leak the msg argument memory.
> > +        */
> > +       pfree(msg);
> >
> > Maybe change the second sentence to something like: "That should be
> > OK, else the user can lower work_mem, but we'd better not leak any
> > additional memory."
>
> It may be a little wordy, but I went with
>
>     /*
>      * In principle, there is nothing to prevent a scan over a large, highly
>      * corrupted table from using workmem worth of memory building up the
>      * tuplestore.  That's ok, but if we also leak the msg argument memory
>      * until the end of the query, we could exceed workmem by more than a
>      * trivial amount.  Therefore, free the msg argument each time we are
>      * called rather than waiting for our current memory context to be freed.
>      */
>
> > +/*
> > + * check_tuphdr_xids
> > + *
> > + *     Determine whether tuples are visible for verification.  Similar to
> > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > + *
> > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > + *     to a table during a corruption check.
> > + *  2) Only makes a boolean determination of whether verification should
> > + *     see the tuple, rather than doing extra work for vacuum-related
> > + *     categorization.
> > + *
> > + *  The caller should already have checked that xmin and xmax are not out 
> > of
> > + *  bounds for the relation.
> > + */
> >
> > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > have a function with that name and, like this one, it returns Boolean,
> > what does true mean? What does false mean? Kinda hard to tell. And
> > also, check the tuple header XIDs *for what*? If you called it, say,
> > tuple_is_visible(), that would be self-evident.
>
> Changed.
>
> > Second, consider that we hold at least AccessShareLock on the relation
> > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > cannot be a concurrent modification to the tuple descriptor in
> > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > potentially using a non-current schema. If the tuple is
> > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > inserting transaction, or that transaction committed before we got our
> > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > before we got our lock. Or if it's both inserted and deleted in the
> > same transaction, say, then that transaction committed before we got
> > our lock or else contains no relevant DDL. IOW, I think you can check
> > everything but dead tuples here.
>
> Ok, I have changed tuple_is_visible to return true rather than false for 
> those other cases.
>
> > Capitalization and punctuation for messages complaining about problems
> > need to be consistent. verify_heapam() has "Invalid redirect line
> > pointer offset %u out of bounds" which starts with a capital letter,
> > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > case, but in any event it should be the same.
>
> I standardized on all lowercase text, though I left embedded symbols and 
> constants such as LOCKED_ONLY alone.
>
> > Also,
> > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > debugging leftover or a very unclear complaint.
>
> Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is 
> invalid in this relation".
>
> > I think some real work
> > needs to be put into the phrasing of these messages so that it's more
> > clear exactly what is going on and why it's bad. For example the first
> > example in this paragraph is clearly a problem of some kind, but it's
> > not very clear exactly what is happening: is %u the offset of the
> > invalid line redirect or the value to which it points? I don't think
> > the phrasing is very grammatical, which makes it hard to tell which is
> > meant, and I actually think it would be a good idea to include both
> > things.
>
> Beware that every row returned from amcheck has more fields than just the 
> error message.
>
>     blkno OUT bigint,
>     offnum OUT integer,
>     lp_off OUT smallint,
>     lp_flags OUT smallint,
>     lp_len OUT smallint,
>     attnum OUT integer,
>     chunk OUT integer,
>     msg OUT text
>
> Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or 
> chunk in the message, it would be better to remove these things from messages 
> that include them.  For the specific message under consideration, I've 
> converted the text to "line pointer redirection to item at offset number %u 
> is outside valid bounds %u .. %u".  That avoids duplicating the offset 
> information of the referring item, while reporting to offset of the referred 
> item.
>
> > Project policy is generally against splitting a string across multiple
> > lines to fit within 80 characters. We like to fit within 80
> > characters, but we like to be able to grep for strings more, and
> > breaking them up like this makes that harder.
>
> Thanks for clarifying the project policy.  I joined these message strings 
> back together.
>
> > +               confess(ctx,
> > +                               pstrdup("corrupt toast chunk va_header"));
> >
> > This is another message that I don't think is very clear. There's two
> > elements to that. One is that the phrasing is not very good, and the
> > other is that there are no % escapes
>
> Changed to "corrupt extended toast chunk with sequence number %d has invalid 
> varlena header %0x".  I think all the other information about where the 
> corruption was found is already present in the other returned columns.
>
> > What's somebody going to do when
> > they see this message? First, they're probably going to have to look
> > at the code to figure out in which circumstances it gets generated;
> > that's a sign that the message isn't phrased clearly enough. That will
> > tell them that an unexpected bit pattern has been found, but not what
> > that unexpected bit pattern actually was. So then, they're going to
> > have to try to find the relevant va_header by some other means and
> > fish out the relevant bit so that they can see what actually went
> > wrong.
>
> Right.
>
> >
> > + *   Checks the current attribute as tracked in ctx for corruption.  
> > Records
> > + *   any corruption found in ctx->corruption.
> > + *
> > + *
> >
> > Extra blank line.
>
> Fixed.
>
> > +       Form_pg_attribute thisatt = 
> > TupleDescAttr(RelationGetDescr(ctx->rel),
> > +
> >                   ctx->attnum);
> >
> > Maybe you could avoid the line wrap by declaring this without
> > initializing it, and then initializing it as a separate statement.
>
> Yes, I like that better.  I did not need to do the same with infomask, but it 
> looks better to me to break the declaration and initialization for both, so I 
> did that.
>
> >
> > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > 
> > %u)",
> > +
> > ctx->tuphdr->t_hoff, ctx->offset,
> > +                                                         ctx->lp_len));
> >
> > Uggh! This isn't even remotely an English sentence. I don't think
> > formulas are the way to go here, but I like the idea of formulas in
> > some places and written-out messages in others even less. I guess the
> > complaint here in English is something like "tuple attribute %d should
> > start at offset %u, but tuple length is only %u" or something of that
> > sort. Also, it seems like this complaint really ought to have been
> > reported on the *preceding* loop iteration, either complaining that
> > (1) the fixed length attribute is more than the number of remaining
> > bytes in the tuple or (2) the varlena header for the tuple specifies
> > an excessively high length. It seems like you're blaming the wrong
> > attribute for the problem.
>
> Yeah, and it wouldn't complain if the final attribute of a tuple was 
> overlong, as there wouldn't be a next attribute to blame it on.  I've changed 
> it to report as you suggest, although it also still complains if the first 
> attribute starts outside the bounds of the tuple.  The two error messages now 
> read as "tuple attribute should start at offset %u, but tuple length is only 
> %u" and "tuple attribute of length %u ends at offset %u, but tuple length is 
> only %u".
>
> > BTW, the header comments for this function (check_tuple_attribute)
> > neglect to document the meaning of the return value.
>
> Fixed.
>
> > +                       confess(ctx, psprintf("tuple xmax = %u
> > precedes relation "
> > +
> > "relfrozenxid = %u",
> >
> > This is another example of these messages needing  work. The
> > corresponding message from heap_prepare_freeze_tuple() is "found
> > update xid %u from before relfrozenxid %u". That's better, because we
> > don't normally include equals signs in our messages like this, and
> > also because "relation relfrozenxid" is redundant. I think this should
> > say something like "tuple xmax %u precedes relfrozenxid %u".
> >
> > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > the future",
> > +                                                                 xmax));
> >
> > And then this could be something like "tuple xmax %u follows
> > last-assigned xid %u". That would be more symmetric and more
> > informative.
>
> Both of these have been changed.
>
> > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > ctx->tuphdr->t_hoff)
> >
> > I think we should be able to predict the exact value of t_hoff and
> > complain if it isn't precisely equal to the expected value. Or is that
> > not possible for some reason?
>
> That is possible, and I've updated the error message to match.  There are 
> cases where you can't know if the HEAP_HASNULL bit is wrong or if the t_hoff 
> value is wrong, but I've changed the code to just compute the length based on 
> the HEAP_HASNULL setting and use that as the expected value, and complain 
> when the actual value does not match the expected.  That sidesteps the 
> problem of not knowing exactly which value to blame.
>
> > Is there some place that's checking that lp_len >=
> > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > the header? If not, there should be.
>
> Good catch.  check_tuple() now does that before reading the header.
>
> > +$node->command_ok(
> >
> > +       [
> > +               'pg_amcheck', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas and tables implicitly');
> > +
> > +$node->command_ok(
> > +       [
> > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas, tables and indexes');
> >
> > I haven't really looked through the btree-checking and pg_amcheck
> > parts of this much yet, but this caught my eye. Why would the default
> > be to check tables but not indexes? I think the default ought to be to
> > check everything we know how to check.
>
> I have changed the default to match your expectations.
>
>
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

Re: new heapcheck contrib module

Reply via email to