On Wed, Mar 30, 2016 at 12:46 PM, Kevin Grittner <kgri...@gmail.com> wrote: > On Wed, Mar 30, 2016 at 2:29 PM, Peter Geoghegan <p...@heroku.com> wrote: > >> [Does the patch allow dangling page pointers?] > >> Again, I don't want to prejudice anyone against your patch, which I >> haven't read. > > I don't believe that the way the patch does its business opens any > new vulnerabilities of this type. If you see such after looking at > the patch, let me know.
Okay, let me be more concrete about this. The patch does this: > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -92,12 +92,21 @@ heap_page_prune_opt(Relation relation, Buffer buffer) > * need to use the horizon that includes slots, otherwise the data-only > * horizon can be used. Note that the toast relation of user defined > * relations are *not* considered catalog relations. > + * > + * It is OK to apply the old snapshot limit before acquiring the cleanup > + * lock because the worst that can happen is that we are not quite as > + * aggressive about the cleanup (by however many transaction IDs are > + * consumed between this point and acquiring the lock). This allows us to > + * save significant overhead in the case where the page is found not to be > + * prunable. > */ > if (IsCatalogRelation(relation) || > RelationIsAccessibleInLogicalDecoding(relation)) > OldestXmin = RecentGlobalXmin; > else > - OldestXmin = RecentGlobalDataXmin; > + OldestXmin = > + TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin, > + relation); This new intermediary function TransactionIdLimitedForOldSnapshots() is called to decide what OldestXmin actually gets to be above, based in part on the new GUC old_snapshot_threshold: > +/* > + * TransactionIdLimitedForOldSnapshots > + * > + * Apply old snapshot limit, if any. This is intended to be called for page > + * pruning and table vacuuming, to allow old_snapshot_threshold to override > + * the normal global xmin value. Actual testing for snapshot too old will be > + * based on whether a snapshot timestamp is prior to the threshold timestamp > + * set in this function. > + */ > +TransactionId > +TransactionIdLimitedForOldSnapshots(TransactionId recentXmin, > + Relation relation) It might not be RecentGlobalDataXmin that is usually returned as OldestXmin as it is today, which is exactly the point of this patch: VACUUM can be more aggressive in cleaning up bloat, not unlike the non-catalog logical decoding case, on the theory that we can reliably detect when that causes failures for old snapshots, and just raise a "snapshot too old" error. (RecentGlobalDataXmin is morally about the same as RecentGlobalXmin, as far as this patch goes). So far, so good. It's okay that _bt_page_recyclable() never got the memo about any of this...: /* * _bt_page_recyclable() -- Is an existing page recyclable? * * This exists to make sure _bt_getbuf and btvacuumscan have the same * policy about whether a page is safe to re-use. */ bool _bt_page_recyclable(Page page) { BTPageOpaque opaque; ... /* * Otherwise, recycle if deleted and too old to have any processes * interested in it. */ opaque = (BTPageOpaque) PageGetSpecialPointer(page); if (P_ISDELETED(opaque) && TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin)) return true; return false; } ...because this patch does nothing to advance RecentGlobalXmin (or RecentGlobalDataXmin) itself more aggressively. It does make vacuum_set_xid_limits() get a more aggressive cutoff point, but we don't see that being passed back down by lazy vacuum here; within _bt_page_recyclable(), we rely on the more conservative RecentGlobalXmin, which is not subject to any clever optimization in the patch. Fortunately, this seems correct, since index scans will always succeed in finding a deleted page, per nbtree README notes on RecentGlobalXmin. Unfortunately, this does stop recycling from happening early for B-Tree pages, even though that's probably safe in principle. This is probably not so bad -- it just needs to be considered when reviewing this patch (the same is true of logical decoding's RecentGlobalDataXmin; it also doesn't appear in _bt_page_recyclable(), and I guess that that was never a problem). Index relations will not get smaller in some important cases, but they will be made less bloated by VACUUM in a sense that's still probably very useful. Maybe that explains some of what Jeff talked about. I think another part of the problems that Jeff mentioned (with pruning) could be this existing code within heap_hot_search_buffer(): /* * If we can't see it, maybe no one else can either. At caller * request, check whether all chain members are dead to all * transactions. */ if (all_dead && *all_dead && !HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin)) *all_dead = false; This is used within routines like btgettuple(), to do the LP_DEAD thing to kill index tuples (not HOT chain pruning). Aside: Not sure offhand why it might be okay, performance-wise, that this code doesn't care about RecentGlobalDataXmin; pruning was a big part of why RecentGlobalDataXmin was added for logical decoding, I thought, although I guess the _bt_killitems() stuff doesn't count as pruning. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers