On Wed, Mar 30, 2016 at 12:46 PM, Kevin Grittner <kgri...@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 2:29 PM, Peter Geoghegan <p...@heroku.com> wrote:
>
>> [Does the patch allow dangling page pointers?]
>
>> Again, I don't want to prejudice anyone against your patch, which I
>> haven't read.
>
> I don't believe that the way the patch does its business opens any
> new vulnerabilities of this type.  If you see such after looking at
> the patch, let me know.

Okay, let me be more concrete about this. The patch does this:

> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -92,12 +92,21 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>      * need to use the horizon that includes slots, otherwise the data-only
>      * horizon can be used. Note that the toast relation of user defined
>      * relations are *not* considered catalog relations.
> +    *
> +    * It is OK to apply the old snapshot limit before acquiring the cleanup
> +    * lock because the worst that can happen is that we are not quite as
> +    * aggressive about the cleanup (by however many transaction IDs are
> +    * consumed between this point and acquiring the lock).  This allows us to
> +    * save significant overhead in the case where the page is found not to be
> +    * prunable.
>      */
>     if (IsCatalogRelation(relation) ||
>         RelationIsAccessibleInLogicalDecoding(relation))
>         OldestXmin = RecentGlobalXmin;
>     else
> -       OldestXmin = RecentGlobalDataXmin;
> +       OldestXmin =
> +               TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
> +                                                   relation);

This new intermediary function TransactionIdLimitedForOldSnapshots()
is called to decide what OldestXmin actually gets to be above, based
in part on the new GUC old_snapshot_threshold:

> +/*
> + * TransactionIdLimitedForOldSnapshots
> + *
> + * Apply old snapshot limit, if any.  This is intended to be called for page
> + * pruning and table vacuuming, to allow old_snapshot_threshold to override
> + * the normal global xmin value.  Actual testing for snapshot too old will be
> + * based on whether a snapshot timestamp is prior to the threshold timestamp
> + * set in this function.
> + */
> +TransactionId
> +TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
> +                                   Relation relation)

It might not be RecentGlobalDataXmin that is usually returned as
OldestXmin as it is today, which is exactly the point of this patch:
VACUUM can be more aggressive in cleaning up bloat, not unlike the
non-catalog logical decoding case, on the theory that we can reliably
detect when that causes failures for old snapshots, and just raise a
"snapshot too old" error. (RecentGlobalDataXmin is morally about the
same as RecentGlobalXmin, as far as this patch goes).

So far, so good. It's okay that _bt_page_recyclable() never got the
memo about any of this...:

/*
 *  _bt_page_recyclable() -- Is an existing page recyclable?
 *
 * This exists to make sure _bt_getbuf and btvacuumscan have the same
 * policy about whether a page is safe to re-use.
 */
bool
_bt_page_recyclable(Page page)
{
    BTPageOpaque opaque;

   ...

    /*
     * Otherwise, recycle if deleted and too old to have any processes
     * interested in it.
     */
    opaque = (BTPageOpaque) PageGetSpecialPointer(page);
    if (P_ISDELETED(opaque) &&
        TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
        return true;
    return false;
}

...because this patch does nothing to advance RecentGlobalXmin (or
RecentGlobalDataXmin) itself more aggressively. It does make
vacuum_set_xid_limits() get a more aggressive cutoff point, but we
don't see that being passed back down by lazy vacuum here; within
_bt_page_recyclable(), we rely on the more conservative
RecentGlobalXmin, which is not subject to any clever optimization in
the patch.

Fortunately, this seems correct, since index scans will always succeed
in finding a deleted page, per nbtree README notes on
RecentGlobalXmin. Unfortunately, this does stop recycling from
happening early for B-Tree pages, even though that's probably safe in
principle. This is probably not so bad -- it just needs to be
considered when reviewing this patch (the same is true of logical
decoding's RecentGlobalDataXmin; it also doesn't appear in
_bt_page_recyclable(), and I guess that that was never a problem).
Index relations will not get smaller in some important cases, but they
will be made less bloated by VACUUM in a sense that's still probably
very useful. Maybe that explains some of what Jeff talked about.

I think another part of the problems that Jeff mentioned (with
pruning) could be this existing code within heap_hot_search_buffer():

        /*
         * If we can't see it, maybe no one else can either.  At caller
         * request, check whether all chain members are dead to all
         * transactions.
         */
        if (all_dead && *all_dead &&
            !HeapTupleIsSurelyDead(heapTuple, RecentGlobalXmin))
            *all_dead = false;

This is used within routines like btgettuple(), to do the LP_DEAD
thing to kill index tuples (not HOT chain pruning).

Aside: Not sure offhand why it might be okay, performance-wise, that
this code doesn't care about RecentGlobalDataXmin; pruning was a big
part of why RecentGlobalDataXmin was added for logical decoding, I
thought, although I guess the _bt_killitems() stuff doesn't count as
pruning.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to