Re: SimpleLruTruncate() mutual exclusion

Noah Misch Fri, 28 Jun 2019 10:07:01 -0700

On Sun, Feb 17, 2019 at 11:31:03PM -0800, Noah Misch wrote:
> I'm forking this thread from
> https://postgr.es/m/flat/20190202083822.gc32...@gust.leadboat.com, which
> reported a race condition involving the "apparent wraparound" safety check in
> SimpleLruTruncate():
> 
> On Wed, Feb 13, 2019 at 11:26:23PM -0800, Noah Misch wrote:
> > 1. The result of the test is valid only until we release the SLRU 
> > ControlLock,
> >    which we do before SlruScanDirCbDeleteCutoff() uses the cutoff to 
> > evaluate
> >    segments for deletion.  Once we release that lock, latest_page_number can
> >    advance.  This creates a TOCTOU race condition, allowing excess deletion:
> > 
> >    [local] test=# table trunc_clog_concurrency ;
> >    ERROR:  could not access status of transaction 2149484247
> >    DETAIL:  Could not open file "pg_xact/0801": No such file or directory.
> 
> > Fixes are available:
> 
> > b. Arrange so only one backend runs vac_truncate_clog() at a time.  Other 
> > than
> >    AsyncCtl, every SLRU truncation appears in vac_truncate_clog(), in a
> >    checkpoint, or in the startup process.  Hence, also arrange for only one
> >    backend to call SimpleLruTruncate(AsyncCtl) at a time.
> 
> More specifically, restrict vac_update_datfrozenxid() to one backend per
> database, and restrict vac_truncate_clog() and asyncQueueAdvanceTail() to one
> backend per cluster.  This, attached, was rather straightforward.


Rebased.  The conflicts were limited to comments and documentation.

> I wonder about performance in a database with millions of small relations,
> particularly considering my intent to back-patch this.  In such databases,
> vac_update_datfrozenxid() can be a major part of the VACUUM's cost.  Two
> things work in our favor.  First, vac_update_datfrozenxid() runs once per
> VACUUM command, not once per relation.  Second, Autovacuum has this logic:
> 
>        * ... we skip
>        * this if (1) we found no work to do and (2) we skipped at least one
>        * table due to concurrent autovacuum activity.  In that case, the other
>        * worker has already done it, or will do so when it finishes.
>        */
>       if (did_vacuum || !found_concurrent_worker)
>               vac_update_datfrozenxid();
> 
> That makes me relatively unworried.  I did consider some alternatives:
> 
> - Run vac_update_datfrozenxid()'s pg_class scan before taking a lock.  If we
>   find the need for pg_database updates, take the lock and scan pg_class again
>   to get final numbers.  This doubles the work in cases that end up taking the
>   lock, so I'm not betting it being a net win.
> 
> - Use LockWaiterCount() to skip vac_update_datfrozenxid() if some other
>   backend is already waiting.  This is similar to Autovacuum's
>   found_concurrent_worker test.  It is tempting.  I'm not proposing it,
>   because it changes the states possible when manual VACUUM completes.  Today,
>   you can predict post-VACUUM datfrozenxid from post-VACUUM relfrozenxid
>   values.  If manual VACUUM could skip vac_update_datfrozenxid() this way,
>   datfrozenxid could lag until some concurrent VACUUM finishes.

commit 647b5e3 (HEAD)
Author:     Noah Misch <n...@leadboat.com>
AuthorDate: Fri Jun 28 10:00:53 2019 -0700
Commit:     Noah Misch <n...@leadboat.com>
CommitDate: Fri Jun 28 10:00:53 2019 -0700

    Prevent concurrent SimpleLruTruncate() for any given SLRU.
    
    The SimpleLruTruncate() header comment states the new coding rule.  This
    closes race conditions that presented rare opportunities for data loss.
    Symptoms and recommendations to users mirror the previous commit[1].
    Back-patch to 9.4 (all supported versions).
    
    Reviewed by TBD.
    
    Discussion: https://postgr.es/m/TBD
    
    [1] slru-truncate-modulo-v2.patch of 
https://postgr.es/m/20190217040913.ga1305...@rfd.leadboat.com

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c..dc82008 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -885,7 +885,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34 
  0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="66"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1080,6 +1080,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   
11:34   0:00 postgres: ser
          the oldest transaction id available to it.</entry>
         </row>
         <row>
+         <entry><literal>WrapLimitsVacuumLock</literal></entry>
+         <entry>Waiting to update limits on transaction id and multixact
+         consumption.</entry>
+        </row>
+        <row>
+         <entry><literal>AsyncQueueTailLock</literal></entry>
+         <entry>Waiting to update limit on notification message
+         storage.</entry>
+        </row>
+        <row>
          <entry><literal>clog</literal></entry>
          <entry>Waiting for I/O on a clog (transaction status) buffer.</entry>
         </row>
@@ -1169,7 +1179,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   
11:34   0:00 postgres: ser
          counters during Parallel Hash plan execution.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>Lock</literal></entry>
+         <entry morerows="10"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
          <entry>Waiting to acquire a lock on a relation.</entry>
         </row>
@@ -1178,6 +1188,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   
11:34   0:00 postgres: ser
          <entry>Waiting to extend a relation.</entry>
         </row>
         <row>
+         <entry><literal>frozen IDs</literal></entry>
+         <entry>Waiting to
+         update 
<structname>pg_database</structname>.<structfield>datfrozenxid</structfield>
+         and 
<structname>pg_database</structname>.<structfield>datminmxid</structfield>.</entry>
+        </row>
+        <row>
          <entry><literal>page</literal></entry>
          <entry>Waiting to acquire a lock on page of a relation.</entry>
         </row>
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 2dbc65b..a3094b4 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1169,6 +1169,14 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 
 /*
  * Remove all segments before the one holding the passed page number
+ *
+ * All SLRUs prevent concurrent calls to this function, either with an LWLock
+ * or by calling it only as part of a checkpoint.  Mutual exclusion must begin
+ * before computing cutoffPage.  Mutual exclusion must end after any limit
+ * update that would permit other backends to write fresh data into the
+ * segment immediately preceding the one containing cutoffPage.  Otherwise,
+ * when the SLRU is quite full, SimpleLruTruncate() might delete that segment
+ * after it has accrued freshly-written data.
  */
 void
 SimpleLruTruncate(SlruCtl ctl, int cutoffPage)
diff --git a/src/backend/access/transam/subtrans.c 
b/src/backend/access/transam/subtrans.c
index e667fd0..8159427 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -349,8 +349,8 @@ ExtendSUBTRANS(TransactionId newestXact)
 /*
  * Remove all SUBTRANS segments before the one holding the passed transaction 
ID
  *
- * This is normally called during checkpoint, with oldestXact being the
- * oldest TransactionXmin of any running transaction.
+ * oldestXact is the oldest TransactionXmin of any running transaction.  This
+ * is called only during checkpoint.
  */
 void
 TruncateSUBTRANS(TransactionId oldestXact)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 738e6ec..365cdd9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -223,19 +223,22 @@ typedef struct QueueBackendStatus
 /*
  * Shared memory state for LISTEN/NOTIFY (excluding its SLRU stuff)
  *
- * The AsyncQueueControl structure is protected by the AsyncQueueLock.
+ * The AsyncQueueControl structure is protected by the AsyncQueueLock and
+ * AsyncQueueTailLock.
  *
- * When holding the lock in SHARED mode, backends may only inspect their own
- * entries as well as the head and tail pointers. Consequently we can allow a
- * backend to update its own record while holding only SHARED lock (since no
- * other backend will inspect it).
+ * When holding AsyncQueueLock in SHARED mode, backends may only inspect their
+ * own entries as well as the head and tail pointers. Consequently we can
+ * allow a backend to update its own record while holding only SHARED lock
+ * (since no other backend will inspect it).
  *
- * When holding the lock in EXCLUSIVE mode, backends can inspect the entries
- * of other backends and also change the head and tail pointers.
+ * When holding AsyncQueueLock in EXCLUSIVE mode, backends can inspect the
+ * entries of other backends and also change the head pointer. When holding
+ * both AsyncQueueLock and AsyncQueueTailLock in EXCLUSIVE mode, backends can
+ * change the tail pointer.
  *
  * AsyncCtlLock is used as the control lock for the pg_notify SLRU buffers.
- * In order to avoid deadlocks, whenever we need both locks, we always first
- * get AsyncQueueLock and then AsyncCtlLock.
+ * In order to avoid deadlocks, whenever we need multiple locks, we first get
+ * AsyncQueueTailLock, then AsyncQueueLock, and lastly AsyncCtlLock.
  *
  * Each backend uses the backend[] array entry with index equal to its
  * BackendId (which can range from 1 to MaxBackends).  We rely on this to make
@@ -2010,6 +2013,10 @@ asyncQueueAdvanceTail(void)
        int                     newtailpage;
        int                     boundary;
 
+       /* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+       LWLockAcquire(AsyncQueueTailLock, LW_EXCLUSIVE);
+
+       /* Compute the new tail. */
        LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
        min = QUEUE_HEAD;
        for (i = 1; i <= MaxBackends; i++)
@@ -2018,7 +2025,6 @@ asyncQueueAdvanceTail(void)
                        min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
        }
        oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
-       QUEUE_TAIL = min;
        LWLockRelease(AsyncQueueLock);
 
        /*
@@ -2038,6 +2044,17 @@ asyncQueueAdvanceTail(void)
                 */
                SimpleLruTruncate(AsyncCtl, newtailpage);
        }
+
+       /*
+        * Advertise the new tail.  This changes asyncQueueIsFull()'s verdict 
for
+        * the segment immediately prior to the new tail, allowing fresh data 
into
+        * that segment.
+        */
+       LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
+       QUEUE_TAIL = min;
+       LWLockRelease(AsyncQueueLock);
+
+       LWLockRelease(AsyncQueueTailLock);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e7b379d..c81d371 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1281,6 +1281,14 @@ vac_update_datfrozenxid(void)
        bool            dirty = false;
 
        /*
+        * Restrict this task to one backend per database.  This avoids race
+        * conditions that would move datfrozenxid or datminmxid backward or 
call
+        * vac_truncate_clog() with a datfrozenxid preceding a datfrozenxid 
passed
+        * to an earlier vac_truncate_clog() call.
+        */
+       LockDatabaseFrozenIds(ExclusiveLock);
+
+       /*
         * Initialize the "min" calculation with GetOldestXmin, which is a
         * reasonable approximation to the minimum relfrozenxid for not-yet-
         * committed pg_class entries for new tables; see AddNewRelationTuple().
@@ -1469,6 +1477,9 @@ vac_truncate_clog(TransactionId frozenXID,
        bool            bogus = false;
        bool            frozenAlreadyWrapped = false;
 
+       /* Restrict task to one backend per cluster; see SimpleLruTruncate(). */
+       LWLockAcquire(WrapLimitsVacuumLock, LW_EXCLUSIVE);
+
        /* init oldest datoids to sync with my frozenXID/minMulti values */
        oldestxid_datoid = MyDatabaseId;
        minmulti_datoid = MyDatabaseId;
@@ -1578,6 +1589,8 @@ vac_truncate_clog(TransactionId frozenXID,
         */
        SetTransactionIdLimit(frozenXID, oldestxid_datoid);
        SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
+
+       LWLockRelease(WrapLimitsVacuumLock);
 }
 
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index f838b0f..1b898f0 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -461,6 +461,21 @@ UnlockRelationForExtension(Relation relation, LOCKMODE 
lockmode)
 }
 
 /*
+ *             LockDatabaseFrozenIds
+ *
+ * This allows one backend per database to execute vac_update_datfrozenxid().
+ */
+void
+LockDatabaseFrozenIds(LOCKMODE lockmode)
+{
+       LOCKTAG         tag;
+
+       SET_LOCKTAG_DATABASE_FROZEN_IDS(tag, MyDatabaseId);
+
+       (void) LockAcquire(&tag, lockmode, false, false);
+}
+
+/*
  *             LockPage
  *
  * Obtain a page-level lock.  This is currently used by some index access
@@ -1100,6 +1115,11 @@ DescribeLockTag(StringInfo buf, const LOCKTAG *tag)
                                                         tag->locktag_field2,
                                                         tag->locktag_field1);
                        break;
+               case LOCKTAG_DATABASE_FROZEN_IDS:
+                       appendStringInfo(buf,
+                                                        
_("pg_database.datfrozenxid of database %u"),
+                                                        tag->locktag_field1);
+                       break;
                case LOCKTAG_PAGE:
                        appendStringInfo(buf,
                                                         _("page %u of relation 
%u of database %u"),
diff --git a/src/backend/storage/lmgr/lwlocknames.txt 
b/src/backend/storage/lmgr/lwlocknames.txt
index db47843..d517023 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,5 @@ MultiXactTruncationLock                               41
 OldSnapshotTimeMapLock                         42
 LogicalRepWorkerLock                           43
 CLogTruncationLock                                     44
+WrapLimitsVacuumLock                           45
+AsyncQueueTailLock                                     46
diff --git a/src/backend/utils/adt/lockfuncs.c 
b/src/backend/utils/adt/lockfuncs.c
index ffd1970..cbf0cab 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -26,6 +26,7 @@
 const char *const LockTagTypeNames[] = {
        "relation",
        "extend",
+       "frozen IDs",
        "page",
        "tuple",
        "transactionid",
@@ -245,6 +246,17 @@ pg_lock_status(PG_FUNCTION_ARGS)
                                nulls[8] = true;
                                nulls[9] = true;
                                break;
+                       case LOCKTAG_DATABASE_FROZEN_IDS:
+                               values[1] = 
ObjectIdGetDatum(instance->locktag.locktag_field1);
+                               nulls[2] = true;
+                               nulls[3] = true;
+                               nulls[4] = true;
+                               nulls[5] = true;
+                               nulls[6] = true;
+                               nulls[7] = true;
+                               nulls[8] = true;
+                               nulls[9] = true;
+                               break;
                        case LOCKTAG_PAGE:
                                values[1] = 
ObjectIdGetDatum(instance->locktag.locktag_field1);
                                values[2] = 
ObjectIdGetDatum(instance->locktag.locktag_field2);
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 099e18f..3f42000 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -59,6 +59,9 @@ extern bool ConditionalLockRelationForExtension(Relation 
relation,
                                                                                
                LOCKMODE lockmode);
 extern int     RelationExtensionLockWaiterCount(Relation relation);
 
+/* Lock to recompute pg_database.datfrozenxid in the current database */
+extern void LockDatabaseFrozenIds(LOCKMODE lockmode);
+
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
 extern bool ConditionalLockPage(Relation relation, BlockNumber blkno, LOCKMODE 
lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 986bb64..5dc7f87 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -139,6 +139,7 @@ typedef enum LockTagType
 {
        LOCKTAG_RELATION,                       /* whole relation */
        LOCKTAG_RELATION_EXTEND,        /* the right to extend a relation */
+       LOCKTAG_DATABASE_FROZEN_IDS,    /* pg_database.datfrozenxid */
        LOCKTAG_PAGE,                           /* one page of a relation */
        LOCKTAG_TUPLE,                          /* one physical tuple */
        LOCKTAG_TRANSACTION,            /* transaction (for waiting for xact 
done) */
@@ -195,6 +196,15 @@ typedef struct LOCKTAG
         (locktag).locktag_type = LOCKTAG_RELATION_EXTEND, \
         (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
 
+/* ID info for frozen IDs is DB OID */
+#define SET_LOCKTAG_DATABASE_FROZEN_IDS(locktag,dboid) \
+       ((locktag).locktag_field1 = (dboid), \
+        (locktag).locktag_field2 = 0, \
+        (locktag).locktag_field3 = 0, \
+        (locktag).locktag_field4 = 0, \
+        (locktag).locktag_type = LOCKTAG_DATABASE_FROZEN_IDS, \
+        (locktag).locktag_lockmethodid = DEFAULT_LOCKMETHOD)
+
 /* ID info for a page is RELATION info + BlockNumber */
 #define SET_LOCKTAG_PAGE(locktag,dboid,reloid,blocknum) \
        ((locktag).locktag_field1 = (dboid), \

Re: SimpleLruTruncate() mutual exclusion

Reply via email to