On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sula...@gmail.com> wrote: > > Hi Mark, > > I think new structures should be listed in src/tools/pgindent/typedefs.list, > otherwise, pgindent might disturb its indentation. > > Regards, > Amul > > > On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger > <mark.dil...@enterprisedb.com> wrote: > > > > > > > > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmh...@gmail.com> wrote: > > > > > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dil...@enterprisedb.com> > > > wrote: > > >> The v10 patch without these ideas is here: > > > > > > Along the lines of what Alvaro was saying before, I think this > > > definitely needs to be split up into a series of patches. The commit > > > message for v10 describes it doing three pretty separate things, and I > > > think that argues for splitting it into a series of three patches. I'd > > > argue for this ordering: > > > > > > 0001 Refactoring existing amcheck btree checking functions to optionally > > > return corruption information rather than ereport'ing it. This is > > > used by the new pg_amcheck command line tool for reporting back to > > > the caller. > > > > > > 0002 Adding new function verify_heapam for checking a heap relation and > > > associated toast relation, if any, to contrib/amcheck. > > > > > > 0003 Adding new contrib module pg_amcheck, which is a command line > > > interface for running amcheck's verifications against tables and > > > indexes. > > > > > > It's too hard to review things like this when it's all mixed together. > > > > The v11 patch series is broken up as you suggest. > > > > > +++ b/contrib/amcheck/t/skipping.pl > > > > > > The name of this file is inconsistent with the tree's usual > > > convention, which is all stuff like 001_whatever.pl, except for > > > src/test/modules/brin, which randomly decided to use two digits > > > instead of three. There's no precedent for a test file with no leading > > > numeric digits. Also, what does "skipping" even have to do with what > > > the test is checking? Maybe it's intended to refer to the new error > > > handling "skipping" the actual error in favor of just reporting it > > > without stopping, but that's not really what the word "skipping" > > > normally means. Finally, it seems a bit over-engineered: do we really > > > need 183 test cases to check that detecting a problem doesn't lead to > > > an abort? Like, if that's the purpose of the test, I'd expect it to > > > check one corrupt relation and one non-corrupt relation, each with and > > > without the no-error behavior. And that's about it. Or maybe it's > > > talking about skipping pages during the checks, because those pages > > > are all-visible or all-frozen? It's not very clear to me what's going > > > on here. > > > > The "skipping" did originally refer to testing verify_heapam()'s option to > > skip all-visible or all-frozen blocks. I have renamed it > > 001_verify_heapam.pl, since it tests that function. > > > > > > > > + TransactionId nextKnownValidXid; > > > + TransactionId oldestValidXid; > > > > > > Please add explanatory comments indicating what these are intended to > > > mean. > > > > Done. > > > > > For most of the the structure members, the brief comments > > > already present seem sufficient; but here, more explanation looks > > > necessary and less is provided. The "Values for returning tuples" > > > could possibly also use some more detail. > > > > Ok, I've expanded the comments for these. > > > > > +#define HEAPCHECK_RELATION_COLS 8 > > > > > > I think this should really be at the top of the file someplace. > > > Sometimes people have adopted this style when the #define is only used > > > within the function that contains it, but that's not the case here. > > > > Done. > > > > > > > > + ereport(ERROR, > > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > > + errmsg("unrecognized parameter for 'skip': %s", skip), > > > + errhint("please choose from 'all visible', 'all frozen', " > > > + "or NULL"))); > > > > > > I think it would be better if we had three string values selecting the > > > different behaviors, and made the parameter NOT NULL but with a > > > default. It seems like that would be easier to understand. Right now, > > > I can tell that my options for what to skip are "all visible", "all > > > frozen", and, uh, some other thing that I don't know what it is. I'm > > > gonna guess the third option is to skip nothing, but it seems best to > > > make that explicit. Also, should we maybe consider spelling this > > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > > > Spaces in an option value seems a little icky to me somehow. > > > > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults > > to 'none'. I did not mark the function as strict, as I think NULL is a > > reasonable value (and the default) for startblock and endblock. > > > > > + int64 startblock = -1; > > > + int64 endblock = -1; > > > ... > > > + if (!PG_ARGISNULL(3)) > > > + startblock = PG_GETARG_INT64(3); > > > + if (!PG_ARGISNULL(4)) > > > + endblock = PG_GETARG_INT64(4); > > > ... > > > + if (startblock < 0) > > > + startblock = 0; > > > + if (endblock < 0 || endblock > ctx.nblocks) > > > + endblock = ctx.nblocks; > > > + > > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > > > > > So, the user can specify a negative value explicitly and it will be > > > treated as the default, and an endblock value that's larger than the > > > relation size will be treated as the relation size. The way pg_prewarm > > > does the corresponding checks seems superior: null indicates the > > > default value, and any non-null value must be within range or you get > > > an error. Also, you seem to be treating endblock as the first block > > > that should not be checked, whereas pg_prewarm takes what seems to me > > > to be the more natural interpretation: the end block is the last block > > > that IS checked. If you do it this way, then someone who specifies the > > > same start and end block will check no blocks -- silently, I think. > > > > Under that regime, for relations with one block of data, (startblock=0, > > endblock=0) means "check the zero'th block", and for relations with no > > blocks of data, specifying any non-null (startblock,endblock) pair raises > > an exception. I don't like that too much, but I'm happy to defer to > > precedent. Since you say pg_prewarm works this way (I did not check), I > > have changed verify_heapam to do likewise. > > > > > + if (skip_all_frozen || skip_all_visible) > > > > > > Since you can't skip all frozen without skipping all visible, this > > > test could be simplified. Or you could introduce a three-valued enum > > > and test that skip_pages != SKIP_PAGES_NONE, which might be even > > > better. > > > > It works now with a three-valued enum. > > > > > + /* We must unlock the page from the prior iteration, if any */ > > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); > > > > > > I don't understand this assertion, and I don't understand the comment, > > > either. I think ctx.blkno can never be equal to InvalidBlockNumber > > > because we never set it to anything outside the range of 0..(endblock > > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer > > > because we just initialized it by calling ReadBufferExtended(). So I > > > think this assertion would still pass if we wrote && rather than ||. > > > But even then, I don't know what that has to do with the comment or > > > why it even makes sense to have an assertion for that in the first > > > place. > > > > Yes, it is vestigial. Removed. > > > > > + /* > > > + * Open the relation. We use ShareUpdateExclusive to prevent > > > concurrent > > > + * vacuums from changing the relfrozenxid, relminmxid, or > > > advancing the > > > + * global oldestXid to be newer than those. This protection > > > saves us from > > > + * having to reacquire the locks and recheck those minimums for > > > every > > > + * tuple, which would be expensive. > > > + */ > > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > > > > > I don't think we'd need to recheck for every tuple, would we? Just for > > > cases where there's an apparent violation of the rules. > > > > It's a bit fuzzy what an "apparent violation" might be if both ends of the > > range of valid xids may be moving, and arbitrarily much. It's also not > > clear how often to recheck, since you'd be dealing with a race condition no > > matter how often you check. Perhaps the comments shouldn't mention how > > often you'd have to recheck, since there is no really defensible choice for > > that. I removed the offending sentence. > > > > > I guess that > > > could still be expensive if there's a lot of them, but needing > > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > > > unfortunate. > > > > I welcome strategies that would allow for taking a lesser lock. > > > > > It's also unclear to me why this concerns itself with relfrozenxid and > > > the cluster-wide oldestXid value but not with datfrozenxid. It seems > > > like if we're going to sanity-check the relfrozenxid against the > > > cluster-wide value, we ought to also check it against the > > > database-wide value. Checking neither would also seem like a plausible > > > choice. But it seems very strange to only check against the > > > cluster-wide value. > > > > If the relation has a normal relfrozenxid, then the oldest valid xid we can > > encounter in the table is relfrozenxid. Otherwise, each row needs to be > > compared against some other minimum xid value. > > > > Logically, that other minimum xid value should be the oldest valid xid for > > the database, which must logically be at least as old as any valid row in > > the table and no older than the oldest valid xid for the cluster. > > > > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be > > believed, and if I am reading them correctly, the stored value for the > > oldest valid xid in the database has been known to be corrupted by bugs in > > pg_upgrade. This is awful. If I compare the xid of a row in a table > > against the oldest xid value for the database, and the xid of the row is > > older, what can I do? I don't have a principled basis for determining > > which one of them is wrong. > > > > The logic in verify_heapam is conservative; it makes no guarantees about > > finding and reporting all corruption, but if it does report a row as > > corrupt, you can bank on that, bugs in verify_heapam itself not > > withstanding. I think this is a good choice; a tool with only false > > negatives is much more useful than one with both false positives and false > > negatives. > > > > I have added a comment about my reasoning to verify_heapam.c. I'm happy to > > be convinced of a better strategy for handling this situation. > > > > > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == > > > FirstOffsetNumber, > > > + "InvalidOffsetNumber > > > increments to FirstOffsetNumber"); > > > > > > If you are going to rely on this property, I agree that it is good to > > > check it. But it would be better to NOT rely on this property, and I > > > suspect the code can be written quite cleanly without relying on it. > > > And actually, that's what you did, because you first set ctx.offnum = > > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > > the loop initializer. So AFAICS the first initializer, and the static > > > assert, are pointless. > > > > Ah, right you are. Removed. > > > > > > > > + if (ItemIdIsRedirected(ctx.itemid)) > > > + { > > > + uint16 redirect = > > > ItemIdGetRedirect(ctx.itemid); > > > + if (redirect <= SizeOfPageHeaderData > > > || redirect >= ph->pd_lower) > > > ... > > > + if ((redirect - SizeOfPageHeaderData) > > > % sizeof(uint16)) > > > > > > I think that ItemIdGetRedirect() returns an offset, not a byte > > > position. So the expectation that I would have is that it would be any > > > integer >= 0 and <= maxoff. Am I confused? > > > > I think you are right about it returning an offset, which should be between > > FirstOffsetNumber and maxoff, inclusive. I have updated the checks. > > > > > BTW, it seems like it might > > > be good to complain if the item to which it points is LP_UNUSED... > > > AFAIK that shouldn't happen. > > > > Thanks for mentioning that. It now checks for that. > > > > > + errmsg("\"%s\" is not a heap AM", > > > > > > I think the correct wording would be just "is not a heap." The "heap > > > AM" is the thing in pg_am, not a specific table. > > > > Fixed. > > > > > +confess(HeapCheckContext * ctx, char *msg) > > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > > > This is what happens when you pgindent without adding all the right > > > things to typedefs.list first ... or when you don't pgindent and have > > > odd ideas about how to indent things. > > > > Hmm. I don't see the three lines of code you are quoting. Which patch is > > that from? > > > > > > > > + /* > > > + * In principle, there is nothing to prevent a scan over a large, > > > highly > > > + * corrupted table from using workmem worth of memory building up > > > the > > > + * tuplestore. Don't leak the msg argument memory. > > > + */ > > > + pfree(msg); > > > > > > Maybe change the second sentence to something like: "That should be > > > OK, else the user can lower work_mem, but we'd better not leak any > > > additional memory." > > > > It may be a little wordy, but I went with > > > > /* > > * In principle, there is nothing to prevent a scan over a large, highly > > * corrupted table from using workmem worth of memory building up the > > * tuplestore. That's ok, but if we also leak the msg argument memory > > * until the end of the query, we could exceed workmem by more than a > > * trivial amount. Therefore, free the msg argument each time we are > > * called rather than waiting for our current memory context to be > > freed. > > */ > > > > > +/* > > > + * check_tuphdr_xids > > > + * > > > + * Determine whether tuples are visible for verification. Similar to > > > + * HeapTupleSatisfiesVacuum, but with critical differences. > > > + * > > > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > > > + * to a table during a corruption check. > > > + * 2) Only makes a boolean determination of whether verification should > > > + * see the tuple, rather than doing extra work for vacuum-related > > > + * categorization. > > > + * > > > + * The caller should already have checked that xmin and xmax are not > > > out of > > > + * bounds for the relation. > > > + */ > > > > > > First, check_tuphdr_xids() doesn't seem like a very good name. If you > > > have a function with that name and, like this one, it returns Boolean, > > > what does true mean? What does false mean? Kinda hard to tell. And > > > also, check the tuple header XIDs *for what*? If you called it, say, > > > tuple_is_visible(), that would be self-evident. > > > > Changed. > > > > > Second, consider that we hold at least AccessShareLock on the relation > > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there > > > cannot be a concurrent modification to the tuple descriptor in > > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is > > > potentially using a non-current schema. If the tuple is > > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the > > > inserting transaction, or that transaction committed before we got our > > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or > > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed > > > before we got our lock. Or if it's both inserted and deleted in the > > > same transaction, say, then that transaction committed before we got > > > our lock or else contains no relevant DDL. IOW, I think you can check > > > everything but dead tuples here. > > > > Ok, I have changed tuple_is_visible to return true rather than false for > > those other cases. > > > > > Capitalization and punctuation for messages complaining about problems > > > need to be consistent. verify_heapam() has "Invalid redirect line > > > pointer offset %u out of bounds" which starts with a capital letter, > > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither > > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower > > > case, but in any event it should be the same. > > > > I standardized on all lowercase text, though I left embedded symbols and > > constants such as LOCKED_ONLY alone. > > > > > Also, > > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a > > > debugging leftover or a very unclear complaint. > > > > Right. That has been changed to "old-style VACUUM FULL transaction ID %u > > is invalid in this relation". > > > > > I think some real work > > > needs to be put into the phrasing of these messages so that it's more > > > clear exactly what is going on and why it's bad. For example the first > > > example in this paragraph is clearly a problem of some kind, but it's > > > not very clear exactly what is happening: is %u the offset of the > > > invalid line redirect or the value to which it points? I don't think > > > the phrasing is very grammatical, which makes it hard to tell which is > > > meant, and I actually think it would be a good idea to include both > > > things. > > > > Beware that every row returned from amcheck has more fields than just the > > error message. > > > > blkno OUT bigint, > > offnum OUT integer, > > lp_off OUT smallint, > > lp_flags OUT smallint, > > lp_len OUT smallint, > > attnum OUT integer, > > chunk OUT integer, > > msg OUT text > > > > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or > > chunk in the message, it would be better to remove these things from > > messages that include them. For the specific message under consideration, > > I've converted the text to "line pointer redirection to item at offset > > number %u is outside valid bounds %u .. %u". That avoids duplicating the > > offset information of the referring item, while reporting to offset of the > > referred item. > > > > > Project policy is generally against splitting a string across multiple > > > lines to fit within 80 characters. We like to fit within 80 > > > characters, but we like to be able to grep for strings more, and > > > breaking them up like this makes that harder. > > > > Thanks for clarifying the project policy. I joined these message strings > > back together.
In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to be joined. e.g: + /* check to see if caller supports us returning a tuplestore */ + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot " + "accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("materialize mode required, but it is not allowed " + "in this context"))); > > > > > + confess(ctx, > > > + pstrdup("corrupt toast chunk va_header")); > > > > > > This is another message that I don't think is very clear. There's two > > > elements to that. One is that the phrasing is not very good, and the > > > other is that there are no % escapes > > > > Changed to "corrupt extended toast chunk with sequence number %d has > > invalid varlena header %0x". I think all the other information about where > > the corruption was found is already present in the other returned columns. > > > > > What's somebody going to do when > > > they see this message? First, they're probably going to have to look > > > at the code to figure out in which circumstances it gets generated; > > > that's a sign that the message isn't phrased clearly enough. That will > > > tell them that an unexpected bit pattern has been found, but not what > > > that unexpected bit pattern actually was. So then, they're going to > > > have to try to find the relevant va_header by some other means and > > > fish out the relevant bit so that they can see what actually went > > > wrong. > > > > Right. > > > > > > > > + * Checks the current attribute as tracked in ctx for corruption. > > > Records > > > + * any corruption found in ctx->corruption. > > > + * > > > + * > > > > > > Extra blank line. > > > > Fixed. > > > > > + Form_pg_attribute thisatt = > > > TupleDescAttr(RelationGetDescr(ctx->rel), > > > + > > > ctx->attnum); > > > > > > Maybe you could avoid the line wrap by declaring this without > > > initializing it, and then initializing it as a separate statement. > > > > Yes, I like that better. I did not need to do the same with infomask, but > > it looks better to me to break the declaration and initialization for both, > > so I did that. > > > > > > > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > > > > %u)", > > > + > > > ctx->tuphdr->t_hoff, ctx->offset, > > > + ctx->lp_len)); > > > > > > Uggh! This isn't even remotely an English sentence. I don't think > > > formulas are the way to go here, but I like the idea of formulas in > > > some places and written-out messages in others even less. I guess the > > > complaint here in English is something like "tuple attribute %d should > > > start at offset %u, but tuple length is only %u" or something of that > > > sort. Also, it seems like this complaint really ought to have been > > > reported on the *preceding* loop iteration, either complaining that > > > (1) the fixed length attribute is more than the number of remaining > > > bytes in the tuple or (2) the varlena header for the tuple specifies > > > an excessively high length. It seems like you're blaming the wrong > > > attribute for the problem. > > > > Yeah, and it wouldn't complain if the final attribute of a tuple was > > overlong, as there wouldn't be a next attribute to blame it on. I've > > changed it to report as you suggest, although it also still complains if > > the first attribute starts outside the bounds of the tuple. The two error > > messages now read as "tuple attribute should start at offset %u, but tuple > > length is only %u" and "tuple attribute of length %u ends at offset %u, but > > tuple length is only %u". > > > > > BTW, the header comments for this function (check_tuple_attribute) > > > neglect to document the meaning of the return value. > > > > Fixed. > > > > > + confess(ctx, psprintf("tuple xmax = %u > > > precedes relation " > > > + > > > "relfrozenxid = %u", > > > > > > This is another example of these messages needing work. The > > > corresponding message from heap_prepare_freeze_tuple() is "found > > > update xid %u from before relfrozenxid %u". That's better, because we > > > don't normally include equals signs in our messages like this, and > > > also because "relation relfrozenxid" is redundant. I think this should > > > say something like "tuple xmax %u precedes relfrozenxid %u". > > > > > > + confess(ctx, psprintf("tuple xmax = %u is in > > > the future", > > > + xmax)); > > > > > > And then this could be something like "tuple xmax %u follows > > > last-assigned xid %u". That would be more symmetric and more > > > informative. > > > > Both of these have been changed. > > > > > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > > > > ctx->tuphdr->t_hoff) > > > > > > I think we should be able to predict the exact value of t_hoff and > > > complain if it isn't precisely equal to the expected value. Or is that > > > not possible for some reason? > > > > That is possible, and I've updated the error message to match. There are > > cases where you can't know if the HEAP_HASNULL bit is wrong or if the > > t_hoff value is wrong, but I've changed the code to just compute the length > > based on the HEAP_HASNULL setting and use that as the expected value, and > > complain when the actual value does not match the expected. That sidesteps > > the problem of not knowing exactly which value to blame. > > > > > Is there some place that's checking that lp_len >= > > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into > > > the header? If not, there should be. > > > > Good catch. check_tuple() now does that before reading the header. > > > > > +$node->command_ok( > > > > > > + [ > > > + 'pg_amcheck', '-p', $port, 'postgres' > > > + ], > > > + 'pg_amcheck all schemas and tables implicitly'); > > > + > > > +$node->command_ok( > > > + [ > > > + 'pg_amcheck', '-i', '-p', $port, 'postgres' > > > + ], > > > + 'pg_amcheck all schemas, tables and indexes'); > > > > > > I haven't really looked through the btree-checking and pg_amcheck > > > parts of this much yet, but this caught my eye. Why would the default > > > be to check tables but not indexes? I think the default ought to be to > > > check everything we know how to check. > > > > I have changed the default to match your expectations. > > > > > > > > — > > Mark Dilger > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company > > > > > >