On Tue, Mar 27, 2018 at 6:48 AM, Pavan Deolasee <pavan.deola...@gmail.com> wrote: > + * When index-to-heap verification is requested, a Bloom filter is used to > + * fingerprint all tuples in the target index, as the index is traversed to > + * verify its structure. A heap scan later verifies the presence in the > heap > + * of all index tuples fingerprinted within the Bloom filter. > + * > > Is that correct? Aren't we actually verifying the presence in the index of > all > heap tuples?
I think that you could describe it either way. We're verifying the presence of heap tuples in the heap that ought to have been in the index (that is, most of those that were fingerprinted). As I say above the callback routine bt_tuple_present_callback(), we blame any corruption on the heap, since that's more likely. It could actually be the index which is corrupt, though, in rare cases where it manages to pass the index structure tests despite being corrupt. This general orientation comes through in the comment that you asked about. > Can we come up with a better name for heapallindexed? May be > "check_heapindexed"? I don't think that that name is any better, and you're the first to mention it in almost a year. I'd rather leave it. The fact that it's a check is pretty clear from context. > What I am not sure about is whether we can really examine an index which is > valid but whose indcheckxmin hasn't crossed our xmin horizon? Notice that > this > amcheck might be running in a transaction block, probably in a > repeatable-read > isolation level and hence GetTransactionSnapshot() might actually return a > snapshot which can't yet read the index consistently. In practice, this is > quite unlikely, but I think we should check for that case if we agree that > it > could be a problem. You're right - there is a narrow window for REPEATABLE READ and SERIALIZABLE transactions. This is a regression in v6, the version removed the TransactionXmin test. I am tempted to fix this by calling GetLatestSnapshot() instead of GetTransactionSnapshot(). However, that has a problem of its own -- it won't work in parallel mode, and we're dealing with parallel restricted functions, not parallel unsafe functions. I don't want to change that to fix such a narrow issue. IMV, a better fix is to treat this as a serialization failure. Attached patch, which applies on top of v7, shows what I mean. I think that this bug is practically impossible to hit, because we use the xmin from the pg_index tuple during "is index safe to use?" indcheckxmin/TransactionXmin checks (the patch that I attached adds a similar though distinct check), which raises another question for a REPEATABLE READ xact. That question is: How is a REPEATABLE READ transaction supposed to see the pg_index entry to get the index relation's oid to call a verification function in the first place? My point is that there is no need for a more complicated solution than what I propose. Note that the attached patch doesn't update the existing comments on TransactionXmin, since I see this as a serialization error, which is a distinct thing -- I'm not actually doing anything with TransactionXmin. > The case with readonly mode is also interesting. Since we're scanning heap > with > SnapshotAny, heapscan may return us tuples which are RECENTLY_DEAD. So the > question is: can this happen? > > - some concurrent index scan sees a heaptuple as DEAD and marks the index > entry as LP_DEAD > - our index fingerprinting sees index tuple as LP_DEAD > - our heap scan still sees the heaptuple as RECENTLY_DEAD > > Now that looks improbable given that we compute OldestXmin after our index > fingerprinting was done i.e between step 2 and 3 and hence if a tuple looked > DEAD to some OldestXmin/RecentGlobalXmin computed before we computed our > OldestXmin, then surely our OldestXmin should also see the tuple DEAD. Or is > there a corner case that we're missing? I don't think so. The way we compute OldestXmin for IndexBuildHeapRangeScan() is rather like a snapshot acquisition -- GetOldestXmin() locks the proc array in shared mode. As you pointed out, the fact that it comes after everything else (fingerprinting) means that it must be equal to or later than what index scans saw, that allowed them to do the kill_prior_tuple() stuff (set LP_DEAD bits). The one exception is Hot Standby mode, where it's possible for the result of GetOldestXmin() (OldestXmin) to go backwards across successive calls. However, that's not a problem for us because readonly heapallindexed verification does not work in Hot Standby mode. > Are there any interesting cases around INSERT_IN_PROGRESS/DELETE_IN_PROGRESS > tuples, especially if those tuples were inserted/deleted by our own > transaction? It probably worth thinking. > > Apart from that, comments in IndexBuildHeapRangeScan() claim that the > function > is called only with ShareLock and above, which is no longer true. We should > check if that has any side-effects. I can't think of any, but better to > verify > and update the comments to reflect new reality, Those comments only seem to apply in the SnapshotAny/ShareLock case, which is what amcheck calls the readonly case. When amcheck does not have a ShareLock, it has an AccessShareLock on the heap relation instead, and we imitate CREATE INDEX CONCURRENTLY. IndexBuildHeapRangeScan() doesn't mention anything about CIC's heap ShareUpdateExclusiveLock (it just says SharedLock), because that lock strength doesn't have anything to do with IndexBuildHeapRangeScan() when it operates with an MVCC snapshot. I think that this means that this patch doesn't need to update comments within IndexBuildHeapRangeScan(). Perhaps that's a good idea, but it seems independent. > One thing that worth documenting/noting is the fact that a !readonly check > will > run with a long-duration registered snapshot, thus holding OldestXmin back. The simple fact that you have a long-running statement already implies that that'll happen, since that must have a snapshot that is at least as old as the first snapshot that the first call to bt_check_index() acquires. It's not a special case; it's exactly as bad as any statement that takes the same amount of time to execute. > Is > there anything we can do to lessen that burden like telling other backends > to > ignore our xmin while computing OldestXmin (like vacuum does)? I don't think so. The transaction involved is still an ordinary user transaction. -- Peter Geoghegan
From 080e3b512a0ad80147cd8d6aaba02e9df5b92daf Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <p...@bowt.ie> Date: Tue, 27 Mar 2018 13:33:33 -0700 Subject: [PATCH 3/3] Defend against heapallindexed using transaction snapshot. --- contrib/amcheck/verify_nbtree.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c index c3380895a9..105945ee3b 100644 --- a/contrib/amcheck/verify_nbtree.c +++ b/contrib/amcheck/verify_nbtree.c @@ -23,8 +23,10 @@ */ #include "postgres.h" +#include "access/htup_details.h" #include "access/nbtree.h" #include "access/transam.h" +#include "access/xact.h" #include "catalog/index.h" #include "catalog/pg_am.h" #include "commands/tablecmds.h" @@ -360,7 +362,30 @@ bt_check_every_level(Relation rel, Relation heaprel, bool readonly, * IndexBuildHeapScan(). */ if (!state->readonly) + { snapshot = RegisterSnapshot(GetTransactionSnapshot()); + + /* + * GetTransactionSnapshot() always acquires a new MVCC snapshot in + * READ COMMITTED mode. A new snapshot is guaranteed to have all + * the entries it requires in the index. + * + * We must defend against the possibility that an old xact snapshot + * was returned at higher isolation levels when that snapshot is + * not safe for index scans of the target index. This is possible + * when the snapshot sees tuples that are before the index's + * indcheckxmin horizon. Throwing an error here should be very + * rare. It doesn't seem worth using a secondary snapshot to avoid + * this. + */ + if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin && + !TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data), + snapshot->xmin)) + ereport(ERROR, + (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), + errmsg("index \"%s\" cannot be verified using transaction snapshot", + RelationGetRelationName(rel)))); + } } /* Create context for page */ -- 2.14.1