Hi Mark, I think new structures should be listed in src/tools/pgindent/typedefs.list, otherwise, pgindent might disturb its indentation.
Regards, Amul On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > > > > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dil...@enterprisedb.com> > > wrote: > >> The v10 patch without these ideas is here: > > > > Along the lines of what Alvaro was saying before, I think this > > definitely needs to be split up into a series of patches. The commit > > message for v10 describes it doing three pretty separate things, and I > > think that argues for splitting it into a series of three patches. I'd > > argue for this ordering: > > > > 0001 Refactoring existing amcheck btree checking functions to optionally > > return corruption information rather than ereport'ing it. This is > > used by the new pg_amcheck command line tool for reporting back to > > the caller. > > > > 0002 Adding new function verify_heapam for checking a heap relation and > > associated toast relation, if any, to contrib/amcheck. > > > > 0003 Adding new contrib module pg_amcheck, which is a command line > > interface for running amcheck's verifications against tables and > > indexes. > > > > It's too hard to review things like this when it's all mixed together. > > The v11 patch series is broken up as you suggest. > > > +++ b/contrib/amcheck/t/skipping.pl > > > > The name of this file is inconsistent with the tree's usual > > convention, which is all stuff like 001_whatever.pl, except for > > src/test/modules/brin, which randomly decided to use two digits > > instead of three. There's no precedent for a test file with no leading > > numeric digits. Also, what does "skipping" even have to do with what > > the test is checking? Maybe it's intended to refer to the new error > > handling "skipping" the actual error in favor of just reporting it > > without stopping, but that's not really what the word "skipping" > > normally means. Finally, it seems a bit over-engineered: do we really > > need 183 test cases to check that detecting a problem doesn't lead to > > an abort? Like, if that's the purpose of the test, I'd expect it to > > check one corrupt relation and one non-corrupt relation, each with and > > without the no-error behavior. And that's about it. Or maybe it's > > talking about skipping pages during the checks, because those pages > > are all-visible or all-frozen? It's not very clear to me what's going > > on here. > > The "skipping" did originally refer to testing verify_heapam()'s option to > skip all-visible or all-frozen blocks. I have renamed it > 001_verify_heapam.pl, since it tests that function. > > > > > + TransactionId nextKnownValidXid; > > + TransactionId oldestValidXid; > > > > Please add explanatory comments indicating what these are intended to > > mean. > > Done. > > > For most of the the structure members, the brief comments > > already present seem sufficient; but here, more explanation looks > > necessary and less is provided. The "Values for returning tuples" > > could possibly also use some more detail. > > Ok, I've expanded the comments for these. > > > +#define HEAPCHECK_RELATION_COLS 8 > > > > I think this should really be at the top of the file someplace. > > Sometimes people have adopted this style when the #define is only used > > within the function that contains it, but that's not the case here. > > Done. > > > > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("unrecognized parameter for 'skip': %s", skip), > > + errhint("please choose from 'all visible', 'all frozen', " > > + "or NULL"))); > > > > I think it would be better if we had three string values selecting the > > different behaviors, and made the parameter NOT NULL but with a > > default. It seems like that would be easier to understand. Right now, > > I can tell that my options for what to skip are "all visible", "all > > frozen", and, uh, some other thing that I don't know what it is. I'm > > gonna guess the third option is to skip nothing, but it seems best to > > make that explicit. Also, should we maybe consider spelling this > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > > Spaces in an option value seems a little icky to me somehow. > > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults > to 'none'. I did not mark the function as strict, as I think NULL is a > reasonable value (and the default) for startblock and endblock. > > > + int64 startblock = -1; > > + int64 endblock = -1; > > ... > > + if (!PG_ARGISNULL(3)) > > + startblock = PG_GETARG_INT64(3); > > + if (!PG_ARGISNULL(4)) > > + endblock = PG_GETARG_INT64(4); > > ... > > + if (startblock < 0) > > + startblock = 0; > > + if (endblock < 0 || endblock > ctx.nblocks) > > + endblock = ctx.nblocks; > > + > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > > > So, the user can specify a negative value explicitly and it will be > > treated as the default, and an endblock value that's larger than the > > relation size will be treated as the relation size. The way pg_prewarm > > does the corresponding checks seems superior: null indicates the > > default value, and any non-null value must be within range or you get > > an error. Also, you seem to be treating endblock as the first block > > that should not be checked, whereas pg_prewarm takes what seems to me > > to be the more natural interpretation: the end block is the last block > > that IS checked. If you do it this way, then someone who specifies the > > same start and end block will check no blocks -- silently, I think. > > Under that regime, for relations with one block of data, (startblock=0, > endblock=0) means "check the zero'th block", and for relations with no blocks > of data, specifying any non-null (startblock,endblock) pair raises an > exception. I don't like that too much, but I'm happy to defer to precedent. > Since you say pg_prewarm works this way (I did not check), I have changed > verify_heapam to do likewise. > > > + if (skip_all_frozen || skip_all_visible) > > > > Since you can't skip all frozen without skipping all visible, this > > test could be simplified. Or you could introduce a three-valued enum > > and test that skip_pages != SKIP_PAGES_NONE, which might be even > > better. > > It works now with a three-valued enum. > > > + /* We must unlock the page from the prior iteration, if any */ > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); > > > > I don't understand this assertion, and I don't understand the comment, > > either. I think ctx.blkno can never be equal to InvalidBlockNumber > > because we never set it to anything outside the range of 0..(endblock > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer > > because we just initialized it by calling ReadBufferExtended(). So I > > think this assertion would still pass if we wrote && rather than ||. > > But even then, I don't know what that has to do with the comment or > > why it even makes sense to have an assertion for that in the first > > place. > > Yes, it is vestigial. Removed. > > > + /* > > + * Open the relation. We use ShareUpdateExclusive to prevent > > concurrent > > + * vacuums from changing the relfrozenxid, relminmxid, or advancing > > the > > + * global oldestXid to be newer than those. This protection > > saves us from > > + * having to reacquire the locks and recheck those minimums for > > every > > + * tuple, which would be expensive. > > + */ > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > > > I don't think we'd need to recheck for every tuple, would we? Just for > > cases where there's an apparent violation of the rules. > > It's a bit fuzzy what an "apparent violation" might be if both ends of the > range of valid xids may be moving, and arbitrarily much. It's also not clear > how often to recheck, since you'd be dealing with a race condition no matter > how often you check. Perhaps the comments shouldn't mention how often you'd > have to recheck, since there is no really defensible choice for that. I > removed the offending sentence. > > > I guess that > > could still be expensive if there's a lot of them, but needing > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > > unfortunate. > > I welcome strategies that would allow for taking a lesser lock. > > > It's also unclear to me why this concerns itself with relfrozenxid and > > the cluster-wide oldestXid value but not with datfrozenxid. It seems > > like if we're going to sanity-check the relfrozenxid against the > > cluster-wide value, we ought to also check it against the > > database-wide value. Checking neither would also seem like a plausible > > choice. But it seems very strange to only check against the > > cluster-wide value. > > If the relation has a normal relfrozenxid, then the oldest valid xid we can > encounter in the table is relfrozenxid. Otherwise, each row needs to be > compared against some other minimum xid value. > > Logically, that other minimum xid value should be the oldest valid xid for > the database, which must logically be at least as old as any valid row in the > table and no older than the oldest valid xid for the cluster. > > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be > believed, and if I am reading them correctly, the stored value for the oldest > valid xid in the database has been known to be corrupted by bugs in > pg_upgrade. This is awful. If I compare the xid of a row in a table against > the oldest xid value for the database, and the xid of the row is older, what > can I do? I don't have a principled basis for determining which one of them > is wrong. > > The logic in verify_heapam is conservative; it makes no guarantees about > finding and reporting all corruption, but if it does report a row as corrupt, > you can bank on that, bugs in verify_heapam itself not withstanding. I think > this is a good choice; a tool with only false negatives is much more useful > than one with both false positives and false negatives. > > I have added a comment about my reasoning to verify_heapam.c. I'm happy to > be convinced of a better strategy for handling this situation. > > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == > > FirstOffsetNumber, > > + "InvalidOffsetNumber > > increments to FirstOffsetNumber"); > > > > If you are going to rely on this property, I agree that it is good to > > check it. But it would be better to NOT rely on this property, and I > > suspect the code can be written quite cleanly without relying on it. > > And actually, that's what you did, because you first set ctx.offnum = > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > the loop initializer. So AFAICS the first initializer, and the static > > assert, are pointless. > > Ah, right you are. Removed. > > > > > + if (ItemIdIsRedirected(ctx.itemid)) > > + { > > + uint16 redirect = > > ItemIdGetRedirect(ctx.itemid); > > + if (redirect <= SizeOfPageHeaderData > > || redirect >= ph->pd_lower) > > ... > > + if ((redirect - SizeOfPageHeaderData) > > % sizeof(uint16)) > > > > I think that ItemIdGetRedirect() returns an offset, not a byte > > position. So the expectation that I would have is that it would be any > > integer >= 0 and <= maxoff. Am I confused? > > I think you are right about it returning an offset, which should be between > FirstOffsetNumber and maxoff, inclusive. I have updated the checks. > > > BTW, it seems like it might > > be good to complain if the item to which it points is LP_UNUSED... > > AFAIK that shouldn't happen. > > Thanks for mentioning that. It now checks for that. > > > + errmsg("\"%s\" is not a heap AM", > > > > I think the correct wording would be just "is not a heap." The "heap > > AM" is the thing in pg_am, not a specific table. > > Fixed. > > > +confess(HeapCheckContext * ctx, char *msg) > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > This is what happens when you pgindent without adding all the right > > things to typedefs.list first ... or when you don't pgindent and have > > odd ideas about how to indent things. > > Hmm. I don't see the three lines of code you are quoting. Which patch is > that from? > > > > > + /* > > + * In principle, there is nothing to prevent a scan over a large, > > highly > > + * corrupted table from using workmem worth of memory building up > > the > > + * tuplestore. Don't leak the msg argument memory. > > + */ > > + pfree(msg); > > > > Maybe change the second sentence to something like: "That should be > > OK, else the user can lower work_mem, but we'd better not leak any > > additional memory." > > It may be a little wordy, but I went with > > /* > * In principle, there is nothing to prevent a scan over a large, highly > * corrupted table from using workmem worth of memory building up the > * tuplestore. That's ok, but if we also leak the msg argument memory > * until the end of the query, we could exceed workmem by more than a > * trivial amount. Therefore, free the msg argument each time we are > * called rather than waiting for our current memory context to be freed. > */ > > > +/* > > + * check_tuphdr_xids > > + * > > + * Determine whether tuples are visible for verification. Similar to > > + * HeapTupleSatisfiesVacuum, but with critical differences. > > + * > > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > > + * to a table during a corruption check. > > + * 2) Only makes a boolean determination of whether verification should > > + * see the tuple, rather than doing extra work for vacuum-related > > + * categorization. > > + * > > + * The caller should already have checked that xmin and xmax are not out > > of > > + * bounds for the relation. > > + */ > > > > First, check_tuphdr_xids() doesn't seem like a very good name. If you > > have a function with that name and, like this one, it returns Boolean, > > what does true mean? What does false mean? Kinda hard to tell. And > > also, check the tuple header XIDs *for what*? If you called it, say, > > tuple_is_visible(), that would be self-evident. > > Changed. > > > Second, consider that we hold at least AccessShareLock on the relation > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there > > cannot be a concurrent modification to the tuple descriptor in > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is > > potentially using a non-current schema. If the tuple is > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the > > inserting transaction, or that transaction committed before we got our > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed > > before we got our lock. Or if it's both inserted and deleted in the > > same transaction, say, then that transaction committed before we got > > our lock or else contains no relevant DDL. IOW, I think you can check > > everything but dead tuples here. > > Ok, I have changed tuple_is_visible to return true rather than false for > those other cases. > > > Capitalization and punctuation for messages complaining about problems > > need to be consistent. verify_heapam() has "Invalid redirect line > > pointer offset %u out of bounds" which starts with a capital letter, > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower > > case, but in any event it should be the same. > > I standardized on all lowercase text, though I left embedded symbols and > constants such as LOCKED_ONLY alone. > > > Also, > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a > > debugging leftover or a very unclear complaint. > > Right. That has been changed to "old-style VACUUM FULL transaction ID %u is > invalid in this relation". > > > I think some real work > > needs to be put into the phrasing of these messages so that it's more > > clear exactly what is going on and why it's bad. For example the first > > example in this paragraph is clearly a problem of some kind, but it's > > not very clear exactly what is happening: is %u the offset of the > > invalid line redirect or the value to which it points? I don't think > > the phrasing is very grammatical, which makes it hard to tell which is > > meant, and I actually think it would be a good idea to include both > > things. > > Beware that every row returned from amcheck has more fields than just the > error message. > > blkno OUT bigint, > offnum OUT integer, > lp_off OUT smallint, > lp_flags OUT smallint, > lp_len OUT smallint, > attnum OUT integer, > chunk OUT integer, > msg OUT text > > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or > chunk in the message, it would be better to remove these things from messages > that include them. For the specific message under consideration, I've > converted the text to "line pointer redirection to item at offset number %u > is outside valid bounds %u .. %u". That avoids duplicating the offset > information of the referring item, while reporting to offset of the referred > item. > > > Project policy is generally against splitting a string across multiple > > lines to fit within 80 characters. We like to fit within 80 > > characters, but we like to be able to grep for strings more, and > > breaking them up like this makes that harder. > > Thanks for clarifying the project policy. I joined these message strings > back together. > > > + confess(ctx, > > + pstrdup("corrupt toast chunk va_header")); > > > > This is another message that I don't think is very clear. There's two > > elements to that. One is that the phrasing is not very good, and the > > other is that there are no % escapes > > Changed to "corrupt extended toast chunk with sequence number %d has invalid > varlena header %0x". I think all the other information about where the > corruption was found is already present in the other returned columns. > > > What's somebody going to do when > > they see this message? First, they're probably going to have to look > > at the code to figure out in which circumstances it gets generated; > > that's a sign that the message isn't phrased clearly enough. That will > > tell them that an unexpected bit pattern has been found, but not what > > that unexpected bit pattern actually was. So then, they're going to > > have to try to find the relevant va_header by some other means and > > fish out the relevant bit so that they can see what actually went > > wrong. > > Right. > > > > > + * Checks the current attribute as tracked in ctx for corruption. > > Records > > + * any corruption found in ctx->corruption. > > + * > > + * > > > > Extra blank line. > > Fixed. > > > + Form_pg_attribute thisatt = > > TupleDescAttr(RelationGetDescr(ctx->rel), > > + > > ctx->attnum); > > > > Maybe you could avoid the line wrap by declaring this without > > initializing it, and then initializing it as a separate statement. > > Yes, I like that better. I did not need to do the same with infomask, but it > looks better to me to break the declaration and initialization for both, so I > did that. > > > > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > > > %u)", > > + > > ctx->tuphdr->t_hoff, ctx->offset, > > + ctx->lp_len)); > > > > Uggh! This isn't even remotely an English sentence. I don't think > > formulas are the way to go here, but I like the idea of formulas in > > some places and written-out messages in others even less. I guess the > > complaint here in English is something like "tuple attribute %d should > > start at offset %u, but tuple length is only %u" or something of that > > sort. Also, it seems like this complaint really ought to have been > > reported on the *preceding* loop iteration, either complaining that > > (1) the fixed length attribute is more than the number of remaining > > bytes in the tuple or (2) the varlena header for the tuple specifies > > an excessively high length. It seems like you're blaming the wrong > > attribute for the problem. > > Yeah, and it wouldn't complain if the final attribute of a tuple was > overlong, as there wouldn't be a next attribute to blame it on. I've changed > it to report as you suggest, although it also still complains if the first > attribute starts outside the bounds of the tuple. The two error messages now > read as "tuple attribute should start at offset %u, but tuple length is only > %u" and "tuple attribute of length %u ends at offset %u, but tuple length is > only %u". > > > BTW, the header comments for this function (check_tuple_attribute) > > neglect to document the meaning of the return value. > > Fixed. > > > + confess(ctx, psprintf("tuple xmax = %u > > precedes relation " > > + > > "relfrozenxid = %u", > > > > This is another example of these messages needing work. The > > corresponding message from heap_prepare_freeze_tuple() is "found > > update xid %u from before relfrozenxid %u". That's better, because we > > don't normally include equals signs in our messages like this, and > > also because "relation relfrozenxid" is redundant. I think this should > > say something like "tuple xmax %u precedes relfrozenxid %u". > > > > + confess(ctx, psprintf("tuple xmax = %u is in > > the future", > > + xmax)); > > > > And then this could be something like "tuple xmax %u follows > > last-assigned xid %u". That would be more symmetric and more > > informative. > > Both of these have been changed. > > > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > > > ctx->tuphdr->t_hoff) > > > > I think we should be able to predict the exact value of t_hoff and > > complain if it isn't precisely equal to the expected value. Or is that > > not possible for some reason? > > That is possible, and I've updated the error message to match. There are > cases where you can't know if the HEAP_HASNULL bit is wrong or if the t_hoff > value is wrong, but I've changed the code to just compute the length based on > the HEAP_HASNULL setting and use that as the expected value, and complain > when the actual value does not match the expected. That sidesteps the > problem of not knowing exactly which value to blame. > > > Is there some place that's checking that lp_len >= > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into > > the header? If not, there should be. > > Good catch. check_tuple() now does that before reading the header. > > > +$node->command_ok( > > > > + [ > > + 'pg_amcheck', '-p', $port, 'postgres' > > + ], > > + 'pg_amcheck all schemas and tables implicitly'); > > + > > +$node->command_ok( > > + [ > > + 'pg_amcheck', '-i', '-p', $port, 'postgres' > > + ], > > + 'pg_amcheck all schemas, tables and indexes'); > > > > I haven't really looked through the btree-checking and pg_amcheck > > parts of this much yet, but this caught my eye. Why would the default > > be to check tables but not indexes? I think the default ought to be to > > check everything we know how to check. > > I have changed the default to match your expectations. > > > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >