[HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Noah Misch Sat, 28 Nov 2015 20:16:57 -0800

On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
> >     /*
> >      * Optional array of WAL flush LSNs associated with entries in the SLRU
> >      * pages.  If not zero/NULL, we must flush WAL before writing pages 
> > (true
> >      * for pg_clog, false for multixact, pg_subtrans, pg_notify).  
> > group_lsn[]
> >      * has lsn_groups_per_page entries per buffer slot, each containing the
> >      * highest LSN known for a contiguous group of SLRU entries on that 
> > slot's
> >      * page.
> >      */
> >     XLogRecPtr *group_lsn;
> >     int                     lsn_groups_per_page;
> >


> Here's the multixact.c comment justifying it:
> 
>  * XLOG interactions: this module generates an XLOG record whenever a new
>  * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
>  * whenever a new MultiXactId is defined.  This allows us to completely
>  * rebuild the data entered since the last checkpoint during XLOG replay.
>  * Because this is possible, we need not follow the normal rule of
>  * "write WAL before data"; the only correctness guarantee needed is that
>  * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>  * checkpoint is considered complete.  If a page does make it to disk ahead
>  * of corresponding WAL records, it will be forcibly zeroed before use anyway.
>  * Therefore, we don't need to mark our pages with LSN information; we have
>  * enough synchronization already.
> 
> The comment's justification is incomplete, though.  What of pages filled over
> the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

         * Note: we need not flush this XLOG entry to disk before proceeding. 
The
         * only way for the MXID to be referenced from any data page is for
         * heap_lock_tuple() to have put it there, and heap_lock_tuple() 
generates
         * an XLOG record that must follow ours.  The normal LSN interlock 
between
         * the data page and that XLOG record will ensure that our XLOG record
         * reaches disk first.  If the SLRU members/offsets data reaches disk
         * sooner than the XLOG record, we do not care because we'll overwrite 
it
         * with zeroes unless the XLOG record is there too; see notes at top of
         * this file.

I find no flaw in the first three sentences.  In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity.  We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages.  That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery.  I drafted the attached comment update to consolidate
and slightly correct this justification.  (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed.  The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation.  MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
  LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
  ExtendMultiXactOffset()
  ExtendMultiXactMember()
  START_CRIT_SECTION()
  (MultiXactState->nextMXact)++
  MultiXactState->nextOffset += nmembers
  LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
  LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
  *offptr = offset;  /* update pg_multixact/offsets SLRU page */
  LWLockRelease(MultiXactOffsetControlLock)
  LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
  *memberptr = members[i].xid;  /* update pg_multixact/members SLRU page */
  *flagsptr = flagsval;  /* update pg_multixact/members SLRU page */
  LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids.  They will skip over SLRU space the current
backend reserved.  Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

         * 2. The next multixact may still be in process of being filled in: 
that
         * is, another process may have done GetNewMultiXactId but not yet 
written
         * the offset entry for that ID.  In that scenario, it is guaranteed 
that
         * the offset entry for that multixact exists (because GetNewMultiXactId
         * won't release MultiXactGenLock until it does) but contains zero
         * (because we are careful to pre-zero offset pages). Because
         * GetNewMultiXactId will never return zero as the starting offset for a
         * multixact, when we read zero as the next multixact's offset, we know 
we
         * have this case.  We sleep for a bit and try again.

A crash while paused this way makes permanent the zero offset.  After
recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
on the immediate previous mxid hangs indefinitely.  Also,
pg_get_multixact_members(zero_offset_mxid) gets an assertion failure.  I have
not thoroughly considered how best to fix these.  Test procedure:

-- backend 1
checkpoint;
create table victim0 (c) as select 1;
create table stall1 (c) as select 1;
create table last2 (c) as select 1;
begin;
select c from victim0 for key share;
select c from stall1 for key share;
select c from last2 for key share;

-- backend 2
begin; update victim0 set c = c + 1; rollback; -- burn one mxid
-- In a shell, attach GDB to backend 2.  GDB will stop the next SQL command at
-- XLogInsert() in MultiXactIdCreateFromMembers():
--   gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
select c from stall1 for key share; -- stall in gdb while making an mxid

-- backend 3
select c from last2 for key share; -- use one more mxid; flush WAL

-- in GDB session, issue command: kill

-- backend 1, after recovery
select c from victim0;          -- hangs, uncancelable

-- backend 2, after recovery: assertion failure
select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

diff --git a/src/backend/access/transam/multixact.c 
b/src/backend/access/transam/multixact.c
index b66a2b6..c027f68 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -24,17 +24,21 @@
  * since it would get completely confused if someone inquired about a bogus
  * MultiXactId that pointed to an intermediate slot containing an XID.)
  *
- * XLOG interactions: this module generates an XLOG record whenever a new
- * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
- * whenever a new MultiXactId is defined.  This allows us to completely
- * rebuild the data entered since the last checkpoint during XLOG replay.
- * Because this is possible, we need not follow the normal rule of
- * "write WAL before data"; the only correctness guarantee needed is that
- * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
- * checkpoint is considered complete.  If a page does make it to disk ahead
- * of corresponding WAL records, it will be forcibly zeroed before use anyway.
- * Therefore, we don't need to mark our pages with LSN information; we have
- * enough synchronization already.
+ * XLOG interactions: this module generates a record whenever a new OFFSETs or
+ * MEMBERs page is initialized to zeroes, as well as an
+ * XLOG_MULTIXACT_CREATE_ID record whenever a new MultiXactId is defined.
+ * This module ignores the WAL rule "write xlog before data," because it
+ * suffices that actions recording a MultiXactId in a heap xmax do follow that
+ * rule.  The only way for the MXID to be referenced from any data page is for
+ * heap_lock_tuple() or heap_update() to have put it there, and each generates
+ * an XLOG record that must follow ours.  The normal LSN interlock between the
+ * data page and that XLOG record will ensure that our XLOG record reaches
+ * disk first.  If the SLRU members/offsets data reaches disk sooner than the
+ * XLOG records, we do not care; after recovery, no xmax will refer to it.  On
+ * the flip side, to ensure that all referenced entries _do_ reach disk, this
+ * module's XLOG records completely rebuild the data entered since the last
+ * checkpoint.  We flush and sync all dirty OFFSETs and MEMBERs pages to disk
+ * before each checkpoint is considered complete.
  *
  * Like clog.c, and unlike subtrans.c, we have to preserve state across
  * crashes and ensure that MXID and offset numbering increases monotonically
@@ -795,19 +799,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember 
*members)
         */
        multi = GetNewMultiXactId(nmembers, &offset);
 
-       /*
-        * Make an XLOG entry describing the new MXID.
-        *
-        * Note: we need not flush this XLOG entry to disk before proceeding. 
The
-        * only way for the MXID to be referenced from any data page is for
-        * heap_lock_tuple() to have put it there, and heap_lock_tuple() 
generates
-        * an XLOG record that must follow ours.  The normal LSN interlock 
between
-        * the data page and that XLOG record will ensure that our XLOG record
-        * reaches disk first.  If the SLRU members/offsets data reaches disk
-        * sooner than the XLOG record, we do not care because we'll overwrite 
it
-        * with zeroes unless the XLOG record is there too; see notes at top of
-        * this file.
-        */
+       /* Make an XLOG entry describing the new MXID. */
        xlrec.mid = multi;
        xlrec.moff = offset;
        xlrec.nmembers = nmembers;
@@ -2037,7 +2029,11 @@ TrimMultiXact(void)
 
        /*
         * Zero out the remainder of the current offsets page.  See notes in
-        * TrimCLOG() for motivation.
+        * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
+        * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the 
WAL
+        * rule "write xlog before data," nextMXact successors may carry 
obsolete,
+        * nonzero offset values.  Zero those so case 2 of 
GetMultiXactIdMembers()
+        * operates normally.
         */
        entryno = MultiXactIdToOffsetEntry(nextMXact);
        if (entryno != 0)

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Reply via email to