Re: new heapcheck contrib module

Amul Sul Mon, 20 Jul 2020 23:52:00 -0700

On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sula...@gmail.com> wrote:
>
> Hi Mark,
>
> I think new structures should be listed in src/tools/pgindent/typedefs.list,
> otherwise, pgindent might disturb its indentation.
>
> Regards,
> Amul
>
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
> >
> >
> >
> > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmh...@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dil...@enterprisedb.com> 
> > > wrote:
> > >> The v10 patch without these ideas is here:
> > >
> > > Along the lines of what Alvaro was saying before, I think this
> > > definitely needs to be split up into a series of patches. The commit
> > > message for v10 describes it doing three pretty separate things, and I
> > > think that argues for splitting it into a series of three patches. I'd
> > > argue for this ordering:
> > >
> > > 0001 Refactoring existing amcheck btree checking functions to optionally
> > > return corruption information rather than ereport'ing it.  This is
> > > used by the new pg_amcheck command line tool for reporting back to
> > > the caller.
> > >
> > > 0002 Adding new function verify_heapam for checking a heap relation and
> > > associated toast relation, if any, to contrib/amcheck.
> > >
> > > 0003 Adding new contrib module pg_amcheck, which is a command line
> > > interface for running amcheck's verifications against tables and
> > > indexes.
> > >
> > > It's too hard to review things like this when it's all mixed together.
> >
> > The v11 patch series is broken up as you suggest.
> >
> > > +++ b/contrib/amcheck/t/skipping.pl
> > >
> > > The name of this file is inconsistent with the tree's usual
> > > convention, which is all stuff like 001_whatever.pl, except for
> > > src/test/modules/brin, which randomly decided to use two digits
> > > instead of three. There's no precedent for a test file with no leading
> > > numeric digits. Also, what does "skipping" even have to do with what
> > > the test is checking? Maybe it's intended to refer to the new error
> > > handling "skipping" the actual error in favor of just reporting it
> > > without stopping, but that's not really what the word "skipping"
> > > normally means. Finally, it seems a bit over-engineered: do we really
> > > need 183 test cases to check that detecting a problem doesn't lead to
> > > an abort? Like, if that's the purpose of the test, I'd expect it to
> > > check one corrupt relation and one non-corrupt relation, each with and
> > > without the no-error behavior. And that's about it. Or maybe it's
> > > talking about skipping pages during the checks, because those pages
> > > are all-visible or all-frozen? It's not very clear to me what's going
> > > on here.
> >
> > The "skipping" did originally refer to testing verify_heapam()'s option to 
> > skip all-visible or all-frozen blocks.  I have renamed it 
> > 001_verify_heapam.pl, since it tests that function.
> >
> > >
> > > + TransactionId nextKnownValidXid;
> > > + TransactionId oldestValidXid;
> > >
> > > Please add explanatory comments indicating what these are intended to
> > > mean.
> >
> > Done.
> >
> > > For most of the the structure members, the brief comments
> > > already present seem sufficient; but here, more explanation looks
> > > necessary and less is provided. The "Values for returning tuples"
> > > could possibly also use some more detail.
> >
> > Ok, I've expanded the comments for these.
> >
> > > +#define HEAPCHECK_RELATION_COLS 8
> > >
> > > I think this should really be at the top of the file someplace.
> > > Sometimes people have adopted this style when the #define is only used
> > > within the function that contains it, but that's not the case here.
> >
> > Done.
> >
> > >
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > > + errhint("please choose from 'all visible', 'all frozen', "
> > > + "or NULL")));
> > >
> > > I think it would be better if we had three string values selecting the
> > > different behaviors, and made the parameter NOT NULL but with a
> > > default. It seems like that would be easier to understand. Right now,
> > > I can tell that my options for what to skip are "all visible", "all
> > > frozen", and, uh, some other thing that I don't know what it is. I'm
> > > gonna guess the third option is to skip nothing, but it seems best to
> > > make that explicit. Also, should we maybe consider spelling this
> > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > > Spaces in an option value seems a little icky to me somehow.
> >
> > I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults 
> > to 'none'.  I did not mark the function as strict, as I think NULL is a 
> > reasonable value (and the default) for startblock and endblock.
> >
> > > + int64 startblock = -1;
> > > + int64 endblock = -1;
> > > ...
> > > + if (!PG_ARGISNULL(3))
> > > + startblock = PG_GETARG_INT64(3);
> > > + if (!PG_ARGISNULL(4))
> > > + endblock = PG_GETARG_INT64(4);
> > > ...
> > > + if (startblock < 0)
> > > + startblock = 0;
> > > + if (endblock < 0 || endblock > ctx.nblocks)
> > > + endblock = ctx.nblocks;
> > > +
> > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> > >
> > > So, the user can specify a negative value explicitly and it will be
> > > treated as the default, and an endblock value that's larger than the
> > > relation size will be treated as the relation size. The way pg_prewarm
> > > does the corresponding checks seems superior: null indicates the
> > > default value, and any non-null value must be within range or you get
> > > an error. Also, you seem to be treating endblock as the first block
> > > that should not be checked, whereas pg_prewarm takes what seems to me
> > > to be the more natural interpretation: the end block is the last block
> > > that IS checked. If you do it this way, then someone who specifies the
> > > same start and end block will check no blocks -- silently, I think.
> >
> > Under that regime, for relations with one block of data, (startblock=0, 
> > endblock=0) means "check the zero'th block", and for relations with no 
> > blocks of data, specifying any non-null (startblock,endblock) pair raises 
> > an exception.  I don't like that too much, but I'm happy to defer to 
> > precedent.  Since you say pg_prewarm works this way (I did not check), I 
> > have changed verify_heapam to do likewise.
> >
> > > +               if (skip_all_frozen || skip_all_visible)
> > >
> > > Since you can't skip all frozen without skipping all visible, this
> > > test could be simplified. Or you could introduce a three-valued enum
> > > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > > better.
> >
> > It works now with a three-valued enum.
> >
> > > + /* We must unlock the page from the prior iteration, if any */
> > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> > >
> > > I don't understand this assertion, and I don't understand the comment,
> > > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > > because we never set it to anything outside the range of 0..(endblock
> > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > > because we just initialized it by calling ReadBufferExtended(). So I
> > > think this assertion would still pass if we wrote && rather than ||.
> > > But even then, I don't know what that has to do with the comment or
> > > why it even makes sense to have an assertion for that in the first
> > > place.
> >
> > Yes, it is vestigial.  Removed.
> >
> > > +       /*
> > > +        * Open the relation.  We use ShareUpdateExclusive to prevent 
> > > concurrent
> > > +        * vacuums from changing the relfrozenxid, relminmxid, or 
> > > advancing the
> > > +        * global oldestXid to be newer than those.  This protection
> > > saves us from
> > > +        * having to reacquire the locks and recheck those minimums for 
> > > every
> > > +        * tuple, which would be expensive.
> > > +        */
> > > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > >
> > > I don't think we'd need to recheck for every tuple, would we? Just for
> > > cases where there's an apparent violation of the rules.
> >
> > It's a bit fuzzy what an "apparent violation" might be if both ends of the 
> > range of valid xids may be moving, and arbitrarily much.  It's also not 
> > clear how often to recheck, since you'd be dealing with a race condition no 
> > matter how often you check.  Perhaps the comments shouldn't mention how 
> > often you'd have to recheck, since there is no really defensible choice for 
> > that.  I removed the offending sentence.
> >
> > > I guess that
> > > could still be expensive if there's a lot of them, but needing
> > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > > unfortunate.
> >
> > I welcome strategies that would allow for taking a lesser lock.
> >
> > > It's also unclear to me why this concerns itself with relfrozenxid and
> > > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > > like if we're going to sanity-check the relfrozenxid against the
> > > cluster-wide value, we ought to also check it against the
> > > database-wide value. Checking neither would also seem like a plausible
> > > choice. But it seems very strange to only check against the
> > > cluster-wide value.
> >
> > If the relation has a normal relfrozenxid, then the oldest valid xid we can 
> > encounter in the table is relfrozenxid.  Otherwise, each row needs to be 
> > compared against some other minimum xid value.
> >
> > Logically, that other minimum xid value should be the oldest valid xid for 
> > the database, which must logically be at least as old as any valid row in 
> > the table and no older than the oldest valid xid for the cluster.
> >
> > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be 
> > believed, and if I am reading them correctly, the stored value for the 
> > oldest valid xid in the database has been known to be corrupted by bugs in 
> > pg_upgrade.  This is awful.  If I compare the xid of a row in a table 
> > against the oldest xid value for the database, and the xid of the row is 
> > older, what can I do?  I don't have a principled basis for determining 
> > which one of them is wrong.
> >
> > The logic in verify_heapam is conservative; it makes no guarantees about 
> > finding and reporting all corruption, but if it does report a row as 
> > corrupt, you can bank on that, bugs in verify_heapam itself not 
> > withstanding.  I think this is a good choice; a tool with only false 
> > negatives is much more useful than one with both false positives and false 
> > negatives.
> >
> > I have added a comment about my reasoning to verify_heapam.c.  I'm happy to 
> > be convinced of a better strategy for handling this situation.
> >
> > >
> > > +               StaticAssertStmt(InvalidOffsetNumber + 1 == 
> > > FirstOffsetNumber,
> > > +                                                "InvalidOffsetNumber
> > > increments to FirstOffsetNumber");
> > >
> > > If you are going to rely on this property, I agree that it is good to
> > > check it. But it would be better to NOT rely on this property, and I
> > > suspect the code can be written quite cleanly without relying on it.
> > > And actually, that's what you did, because you first set ctx.offnum =
> > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > > the loop initializer. So AFAICS the first initializer, and the static
> > > assert, are pointless.
> >
> > Ah, right you are.  Removed.
> >
> > >
> > > +                       if (ItemIdIsRedirected(ctx.itemid))
> > > +                       {
> > > +                               uint16 redirect = 
> > > ItemIdGetRedirect(ctx.itemid);
> > > +                               if (redirect <= SizeOfPageHeaderData
> > > || redirect >= ph->pd_lower)
> > > ...
> > > +                               if ((redirect - SizeOfPageHeaderData)
> > > % sizeof(uint16))
> > >
> > > I think that ItemIdGetRedirect() returns an offset, not a byte
> > > position. So the expectation that I would have is that it would be any
> > > integer >= 0 and <= maxoff. Am I confused?
> >
> > I think you are right about it returning an offset, which should be between 
> > FirstOffsetNumber and maxoff, inclusive.  I have updated the checks.
> >
> > > BTW, it seems like it might
> > > be good to complain if the item to which it points is LP_UNUSED...
> > > AFAIK that shouldn't happen.
> >
> > Thanks for mentioning that.  It now checks for that.
> >
> > > +                                errmsg("\"%s\" is not a heap AM",
> > >
> > > I think the correct wording would be just "is not a heap." The "heap
> > > AM" is the thing in pg_am, not a specific table.
> >
> > Fixed.
> >
> > > +confess(HeapCheckContext * ctx, char *msg)
> > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> > >
> > > This is what happens when you pgindent without adding all the right
> > > things to typedefs.list first ... or when you don't pgindent and have
> > > odd ideas about how to indent things.
> >
> > Hmm.  I don't see the three lines of code you are quoting.  Which patch is 
> > that from?
> >
> > >
> > > +       /*
> > > +        * In principle, there is nothing to prevent a scan over a large, 
> > > highly
> > > +        * corrupted table from using workmem worth of memory building up 
> > > the
> > > +        * tuplestore.  Don't leak the msg argument memory.
> > > +        */
> > > +       pfree(msg);
> > >
> > > Maybe change the second sentence to something like: "That should be
> > > OK, else the user can lower work_mem, but we'd better not leak any
> > > additional memory."
> >
> > It may be a little wordy, but I went with
> >
> >     /*
> >      * In principle, there is nothing to prevent a scan over a large, highly
> >      * corrupted table from using workmem worth of memory building up the
> >      * tuplestore.  That's ok, but if we also leak the msg argument memory
> >      * until the end of the query, we could exceed workmem by more than a
> >      * trivial amount.  Therefore, free the msg argument each time we are
> >      * called rather than waiting for our current memory context to be 
> > freed.
> >      */
> >
> > > +/*
> > > + * check_tuphdr_xids
> > > + *
> > > + *     Determine whether tuples are visible for verification.  Similar to
> > > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > > + *
> > > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > > + *     to a table during a corruption check.
> > > + *  2) Only makes a boolean determination of whether verification should
> > > + *     see the tuple, rather than doing extra work for vacuum-related
> > > + *     categorization.
> > > + *
> > > + *  The caller should already have checked that xmin and xmax are not 
> > > out of
> > > + *  bounds for the relation.
> > > + */
> > >
> > > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > > have a function with that name and, like this one, it returns Boolean,
> > > what does true mean? What does false mean? Kinda hard to tell. And
> > > also, check the tuple header XIDs *for what*? If you called it, say,
> > > tuple_is_visible(), that would be self-evident.
> >
> > Changed.
> >
> > > Second, consider that we hold at least AccessShareLock on the relation
> > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > > cannot be a concurrent modification to the tuple descriptor in
> > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > > potentially using a non-current schema. If the tuple is
> > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > > inserting transaction, or that transaction committed before we got our
> > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > > before we got our lock. Or if it's both inserted and deleted in the
> > > same transaction, say, then that transaction committed before we got
> > > our lock or else contains no relevant DDL. IOW, I think you can check
> > > everything but dead tuples here.
> >
> > Ok, I have changed tuple_is_visible to return true rather than false for 
> > those other cases.
> >
> > > Capitalization and punctuation for messages complaining about problems
> > > need to be consistent. verify_heapam() has "Invalid redirect line
> > > pointer offset %u out of bounds" which starts with a capital letter,
> > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > > case, but in any event it should be the same.
> >
> > I standardized on all lowercase text, though I left embedded symbols and 
> > constants such as LOCKED_ONLY alone.
> >
> > > Also,
> > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > > debugging leftover or a very unclear complaint.
> >
> > Right.  That has been changed to "old-style VACUUM FULL transaction ID %u 
> > is invalid in this relation".
> >
> > > I think some real work
> > > needs to be put into the phrasing of these messages so that it's more
> > > clear exactly what is going on and why it's bad. For example the first
> > > example in this paragraph is clearly a problem of some kind, but it's
> > > not very clear exactly what is happening: is %u the offset of the
> > > invalid line redirect or the value to which it points? I don't think
> > > the phrasing is very grammatical, which makes it hard to tell which is
> > > meant, and I actually think it would be a good idea to include both
> > > things.
> >
> > Beware that every row returned from amcheck has more fields than just the 
> > error message.
> >
> >     blkno OUT bigint,
> >     offnum OUT integer,
> >     lp_off OUT smallint,
> >     lp_flags OUT smallint,
> >     lp_len OUT smallint,
> >     attnum OUT integer,
> >     chunk OUT integer,
> >     msg OUT text
> >
> > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or 
> > chunk in the message, it would be better to remove these things from 
> > messages that include them.  For the specific message under consideration, 
> > I've converted the text to "line pointer redirection to item at offset 
> > number %u is outside valid bounds %u .. %u".  That avoids duplicating the 
> > offset information of the referring item, while reporting to offset of the 
> > referred item.
> >
> > > Project policy is generally against splitting a string across multiple
> > > lines to fit within 80 characters. We like to fit within 80
> > > characters, but we like to be able to grep for strings more, and
> > > breaking them up like this makes that harder.
> >
> > Thanks for clarifying the project policy.  I joined these message strings 
> > back together.


In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
be joined.

e.g:

+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot "
+ "accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("materialize mode required, but it is not allowed "
+ "in this context")));

> >
> > > +               confess(ctx,
> > > +                               pstrdup("corrupt toast chunk va_header"));
> > >
> > > This is another message that I don't think is very clear. There's two
> > > elements to that. One is that the phrasing is not very good, and the
> > > other is that there are no % escapes
> >
> > Changed to "corrupt extended toast chunk with sequence number %d has 
> > invalid varlena header %0x".  I think all the other information about where 
> > the corruption was found is already present in the other returned columns.
> >
> > > What's somebody going to do when
> > > they see this message? First, they're probably going to have to look
> > > at the code to figure out in which circumstances it gets generated;
> > > that's a sign that the message isn't phrased clearly enough. That will
> > > tell them that an unexpected bit pattern has been found, but not what
> > > that unexpected bit pattern actually was. So then, they're going to
> > > have to try to find the relevant va_header by some other means and
> > > fish out the relevant bit so that they can see what actually went
> > > wrong.
> >
> > Right.
> >
> > >
> > > + *   Checks the current attribute as tracked in ctx for corruption.  
> > > Records
> > > + *   any corruption found in ctx->corruption.
> > > + *
> > > + *
> > >
> > > Extra blank line.
> >
> > Fixed.
> >
> > > +       Form_pg_attribute thisatt = 
> > > TupleDescAttr(RelationGetDescr(ctx->rel),
> > > +
> > >                   ctx->attnum);
> > >
> > > Maybe you could avoid the line wrap by declaring this without
> > > initializing it, and then initializing it as a separate statement.
> >
> > Yes, I like that better.  I did not need to do the same with infomask, but 
> > it looks better to me to break the declaration and initialization for both, 
> > so I did that.
> >
> > >
> > > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u 
> > > > %u)",
> > > +
> > > ctx->tuphdr->t_hoff, ctx->offset,
> > > +                                                         ctx->lp_len));
> > >
> > > Uggh! This isn't even remotely an English sentence. I don't think
> > > formulas are the way to go here, but I like the idea of formulas in
> > > some places and written-out messages in others even less. I guess the
> > > complaint here in English is something like "tuple attribute %d should
> > > start at offset %u, but tuple length is only %u" or something of that
> > > sort. Also, it seems like this complaint really ought to have been
> > > reported on the *preceding* loop iteration, either complaining that
> > > (1) the fixed length attribute is more than the number of remaining
> > > bytes in the tuple or (2) the varlena header for the tuple specifies
> > > an excessively high length. It seems like you're blaming the wrong
> > > attribute for the problem.
> >
> > Yeah, and it wouldn't complain if the final attribute of a tuple was 
> > overlong, as there wouldn't be a next attribute to blame it on.  I've 
> > changed it to report as you suggest, although it also still complains if 
> > the first attribute starts outside the bounds of the tuple.  The two error 
> > messages now read as "tuple attribute should start at offset %u, but tuple 
> > length is only %u" and "tuple attribute of length %u ends at offset %u, but 
> > tuple length is only %u".
> >
> > > BTW, the header comments for this function (check_tuple_attribute)
> > > neglect to document the meaning of the return value.
> >
> > Fixed.
> >
> > > +                       confess(ctx, psprintf("tuple xmax = %u
> > > precedes relation "
> > > +
> > > "relfrozenxid = %u",
> > >
> > > This is another example of these messages needing  work. The
> > > corresponding message from heap_prepare_freeze_tuple() is "found
> > > update xid %u from before relfrozenxid %u". That's better, because we
> > > don't normally include equals signs in our messages like this, and
> > > also because "relation relfrozenxid" is redundant. I think this should
> > > say something like "tuple xmax %u precedes relfrozenxid %u".
> > >
> > > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > > the future",
> > > +                                                                 xmax));
> > >
> > > And then this could be something like "tuple xmax %u follows
> > > last-assigned xid %u". That would be more symmetric and more
> > > informative.
> >
> > Both of these have been changed.
> >
> > > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > > ctx->tuphdr->t_hoff)
> > >
> > > I think we should be able to predict the exact value of t_hoff and
> > > complain if it isn't precisely equal to the expected value. Or is that
> > > not possible for some reason?
> >
> > That is possible, and I've updated the error message to match.  There are 
> > cases where you can't know if the HEAP_HASNULL bit is wrong or if the 
> > t_hoff value is wrong, but I've changed the code to just compute the length 
> > based on the HEAP_HASNULL setting and use that as the expected value, and 
> > complain when the actual value does not match the expected.  That sidesteps 
> > the problem of not knowing exactly which value to blame.
> >
> > > Is there some place that's checking that lp_len >=
> > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > > the header? If not, there should be.
> >
> > Good catch.  check_tuple() now does that before reading the header.
> >
> > > +$node->command_ok(
> > >
> > > +       [
> > > +               'pg_amcheck', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas and tables implicitly');
> > > +
> > > +$node->command_ok(
> > > +       [
> > > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas, tables and indexes');
> > >
> > > I haven't really looked through the btree-checking and pg_amcheck
> > > parts of this much yet, but this caught my eye. Why would the default
> > > be to check tables but not indexes? I think the default ought to be to
> > > check everything we know how to check.
> >
> > I have changed the default to match your expectations.
> >
> >
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >

Re: new heapcheck contrib module

Reply via email to