On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > The v10 patch without these ideas is here:
Along the lines of what Alvaro was saying before, I think this definitely needs to be split up into a series of patches. The commit message for v10 describes it doing three pretty separate things, and I think that argues for splitting it into a series of three patches. I'd argue for this ordering: 0001 Refactoring existing amcheck btree checking functions to optionally return corruption information rather than ereport'ing it. This is used by the new pg_amcheck command line tool for reporting back to the caller. 0002 Adding new function verify_heapam for checking a heap relation and associated toast relation, if any, to contrib/amcheck. 0003 Adding new contrib module pg_amcheck, which is a command line interface for running amcheck's verifications against tables and indexes. It's too hard to review things like this when it's all mixed together. +++ b/contrib/amcheck/t/skipping.pl The name of this file is inconsistent with the tree's usual convention, which is all stuff like 001_whatever.pl, except for src/test/modules/brin, which randomly decided to use two digits instead of three. There's no precedent for a test file with no leading numeric digits. Also, what does "skipping" even have to do with what the test is checking? Maybe it's intended to refer to the new error handling "skipping" the actual error in favor of just reporting it without stopping, but that's not really what the word "skipping" normally means. Finally, it seems a bit over-engineered: do we really need 183 test cases to check that detecting a problem doesn't lead to an abort? Like, if that's the purpose of the test, I'd expect it to check one corrupt relation and one non-corrupt relation, each with and without the no-error behavior. And that's about it. Or maybe it's talking about skipping pages during the checks, because those pages are all-visible or all-frozen? It's not very clear to me what's going on here. + TransactionId nextKnownValidXid; + TransactionId oldestValidXid; Please add explanatory comments indicating what these are intended to mean. For most of the the structure members, the brief comments already present seem sufficient; but here, more explanation looks necessary and less is provided. The "Values for returning tuples" could possibly also use some more detail. +#define HEAPCHECK_RELATION_COLS 8 I think this should really be at the top of the file someplace. Sometimes people have adopted this style when the #define is only used within the function that contains it, but that's not the case here. + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognized parameter for 'skip': %s", skip), + errhint("please choose from 'all visible', 'all frozen', " + "or NULL"))); I think it would be better if we had three string values selecting the different behaviors, and made the parameter NOT NULL but with a default. It seems like that would be easier to understand. Right now, I can tell that my options for what to skip are "all visible", "all frozen", and, uh, some other thing that I don't know what it is. I'm gonna guess the third option is to skip nothing, but it seems best to make that explicit. Also, should we maybe consider spelling this 'all-visible' and 'all-frozen' with dashes, instead of using spaces? Spaces in an option value seems a little icky to me somehow. + int64 startblock = -1; + int64 endblock = -1; ... + if (!PG_ARGISNULL(3)) + startblock = PG_GETARG_INT64(3); + if (!PG_ARGISNULL(4)) + endblock = PG_GETARG_INT64(4); ... + if (startblock < 0) + startblock = 0; + if (endblock < 0 || endblock > ctx.nblocks) + endblock = ctx.nblocks; + + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) So, the user can specify a negative value explicitly and it will be treated as the default, and an endblock value that's larger than the relation size will be treated as the relation size. The way pg_prewarm does the corresponding checks seems superior: null indicates the default value, and any non-null value must be within range or you get an error. Also, you seem to be treating endblock as the first block that should not be checked, whereas pg_prewarm takes what seems to me to be the more natural interpretation: the end block is the last block that IS checked. If you do it this way, then someone who specifies the same start and end block will check no blocks -- silently, I think. + if (skip_all_frozen || skip_all_visible) Since you can't skip all frozen without skipping all visible, this test could be simplified. Or you could introduce a three-valued enum and test that skip_pages != SKIP_PAGES_NONE, which might be even better. + /* We must unlock the page from the prior iteration, if any */ + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); I don't understand this assertion, and I don't understand the comment, either. I think ctx.blkno can never be equal to InvalidBlockNumber because we never set it to anything outside the range of 0..(endblock - 1), and I think ctx.buffer must always be unequal to InvalidBuffer because we just initialized it by calling ReadBufferExtended(). So I think this assertion would still pass if we wrote && rather than ||. But even then, I don't know what that has to do with the comment or why it even makes sense to have an assertion for that in the first place. + /* + * Open the relation. We use ShareUpdateExclusive to prevent concurrent + * vacuums from changing the relfrozenxid, relminmxid, or advancing the + * global oldestXid to be newer than those. This protection saves us from + * having to reacquire the locks and recheck those minimums for every + * tuple, which would be expensive. + */ + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); I don't think we'd need to recheck for every tuple, would we? Just for cases where there's an apparent violation of the rules. I guess that could still be expensive if there's a lot of them, but needing ShareUpdateExclusiveLock rather than only AccessShareLock is a little unfortunate. It's also unclear to me why this concerns itself with relfrozenxid and the cluster-wide oldestXid value but not with datfrozenxid. It seems like if we're going to sanity-check the relfrozenxid against the cluster-wide value, we ought to also check it against the database-wide value. Checking neither would also seem like a plausible choice. But it seems very strange to only check against the cluster-wide value. + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, + "InvalidOffsetNumber increments to FirstOffsetNumber"); If you are going to rely on this property, I agree that it is good to check it. But it would be better to NOT rely on this property, and I suspect the code can be written quite cleanly without relying on it. And actually, that's what you did, because you first set ctx.offnum = InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in the loop initializer. So AFAICS the first initializer, and the static assert, are pointless. + if (ItemIdIsRedirected(ctx.itemid)) + { + uint16 redirect = ItemIdGetRedirect(ctx.itemid); + if (redirect <= SizeOfPageHeaderData || redirect >= ph->pd_lower) ... + if ((redirect - SizeOfPageHeaderData) % sizeof(uint16)) I think that ItemIdGetRedirect() returns an offset, not a byte position. So the expectation that I would have is that it would be any integer >= 0 and <= maxoff. Am I confused? BTW, it seems like it might be good to complain if the item to which it points is LP_UNUSED... AFAIK that shouldn't happen. + errmsg("\"%s\" is not a heap AM", I think the correct wording would be just "is not a heap." The "heap AM" is the thing in pg_am, not a specific table. +confess(HeapCheckContext * ctx, char *msg) +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) This is what happens when you pgindent without adding all the right things to typedefs.list first ... or when you don't pgindent and have odd ideas about how to indent things. + /* + * In principle, there is nothing to prevent a scan over a large, highly + * corrupted table from using workmem worth of memory building up the + * tuplestore. Don't leak the msg argument memory. + */ + pfree(msg); Maybe change the second sentence to something like: "That should be OK, else the user can lower work_mem, but we'd better not leak any additional memory." +/* + * check_tuphdr_xids + * + * Determine whether tuples are visible for verification. Similar to + * HeapTupleSatisfiesVacuum, but with critical differences. + * + * 1) Does not touch hint bits. It seems imprudent to write hint bits + * to a table during a corruption check. + * 2) Only makes a boolean determination of whether verification should + * see the tuple, rather than doing extra work for vacuum-related + * categorization. + * + * The caller should already have checked that xmin and xmax are not out of + * bounds for the relation. + */ First, check_tuphdr_xids() doesn't seem like a very good name. If you have a function with that name and, like this one, it returns Boolean, what does true mean? What does false mean? Kinda hard to tell. And also, check the tuple header XIDs *for what*? If you called it, say, tuple_is_visible(), that would be self-evident. Second, consider that we hold at least AccessShareLock on the relation - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there cannot be a concurrent modification to the tuple descriptor in progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is potentially using a non-current schema. If the tuple is HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the inserting transaction, or that transaction committed before we got our lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed before we got our lock. Or if it's both inserted and deleted in the same transaction, say, then that transaction committed before we got our lock or else contains no relevant DDL. IOW, I think you can check everything but dead tuples here. Capitalization and punctuation for messages complaining about problems need to be consistent. verify_heapam() has "Invalid redirect line pointer offset %u out of bounds" which starts with a capital letter, but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower case, but in any event it should be the same. Also, check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a debugging leftover or a very unclear complaint. I think some real work needs to be put into the phrasing of these messages so that it's more clear exactly what is going on and why it's bad. For example the first example in this paragraph is clearly a problem of some kind, but it's not very clear exactly what is happening: is %u the offset of the invalid line redirect or the value to which it points? I don't think the phrasing is very grammatical, which makes it hard to tell which is meant, and I actually think it would be a good idea to include both things. Project policy is generally against splitting a string across multiple lines to fit within 80 characters. We like to fit within 80 characters, but we like to be able to grep for strings more, and breaking them up like this makes that harder. + confess(ctx, + pstrdup("corrupt toast chunk va_header")); This is another message that I don't think is very clear. There's two elements to that. One is that the phrasing is not very good, and the other is that there are no % escapes. What's somebody going to do when they see this message? First, they're probably going to have to look at the code to figure out in which circumstances it gets generated; that's a sign that the message isn't phrased clearly enough. That will tell them that an unexpected bit pattern has been found, but not what that unexpected bit pattern actually was. So then, they're going to have to try to find the relevant va_header by some other means and fish out the relevant bit so that they can see what actually went wrong. + * Checks the current attribute as tracked in ctx for corruption. Records + * any corruption found in ctx->corruption. + * + * Extra blank line. + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), + ctx->attnum); Maybe you could avoid the line wrap by declaring this without initializing it, and then initializing it as a separate statement. + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)", + ctx->tuphdr->t_hoff, ctx->offset, + ctx->lp_len)); Uggh! This isn't even remotely an English sentence. I don't think formulas are the way to go here, but I like the idea of formulas in some places and written-out messages in others even less. I guess the complaint here in English is something like "tuple attribute %d should start at offset %u, but tuple length is only %u" or something of that sort. Also, it seems like this complaint really ought to have been reported on the *preceding* loop iteration, either complaining that (1) the fixed length attribute is more than the number of remaining bytes in the tuple or (2) the varlena header for the tuple specifies an excessively high length. It seems like you're blaming the wrong attribute for the problem. BTW, the header comments for this function (check_tuple_attribute) neglect to document the meaning of the return value. + confess(ctx, psprintf("tuple xmax = %u precedes relation " + "relfrozenxid = %u", This is another example of these messages needing work. The corresponding message from heap_prepare_freeze_tuple() is "found update xid %u from before relfrozenxid %u". That's better, because we don't normally include equals signs in our messages like this, and also because "relation relfrozenxid" is redundant. I think this should say something like "tuple xmax %u precedes relfrozenxid %u". + confess(ctx, psprintf("tuple xmax = %u is in the future", + xmax)); And then this could be something like "tuple xmax %u follows last-assigned xid %u". That would be more symmetric and more informative. + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > ctx->tuphdr->t_hoff) I think we should be able to predict the exact value of t_hoff and complain if it isn't precisely equal to the expected value. Or is that not possible for some reason? Is there some place that's checking that lp_len >= SizeOfHeapTupleHeader before check_tuple() goes and starts poking into the header? If not, there should be. +$node->command_ok( + [ + 'pg_amcheck', '-p', $port, 'postgres' + ], + 'pg_amcheck all schemas and tables implicitly'); + +$node->command_ok( + [ + 'pg_amcheck', '-i', '-p', $port, 'postgres' + ], + 'pg_amcheck all schemas, tables and indexes'); I haven't really looked through the btree-checking and pg_amcheck parts of this much yet, but this caught my eye. Why would the default be to check tables but not indexes? I think the default ought to be to check everything we know how to check. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company