Re: [HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Robert Haas Tue, 01 Dec 2015 14:00:12 -0800

On Sat, Nov 28, 2015 at 11:15 PM, Noah Misch <[email protected]> wrote:
> On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
>> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
>> >     /*
>> >      * Optional array of WAL flush LSNs associated with entries in the SLRU
>> >      * pages.  If not zero/NULL, we must flush WAL before writing pages 
>> > (true
>> >      * for pg_clog, false for multixact, pg_subtrans, pg_notify).  
>> > group_lsn[]
>> >      * has lsn_groups_per_page entries per buffer slot, each containing the
>> >      * highest LSN known for a contiguous group of SLRU entries on that 
>> > slot's
>> >      * page.
>> >      */
>> >     XLogRecPtr *group_lsn;
>> >     int                     lsn_groups_per_page;
>> >
>
>> Here's the multixact.c comment justifying it:
>>
>>  * XLOG interactions: this module generates an XLOG record whenever a new
>>  * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG 
>> record
>>  * whenever a new MultiXactId is defined.  This allows us to completely
>>  * rebuild the data entered since the last checkpoint during XLOG replay.
>>  * Because this is possible, we need not follow the normal rule of
>>  * "write WAL before data"; the only correctness guarantee needed is that
>>  * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>>  * checkpoint is considered complete.  If a page does make it to disk ahead
>>  * of corresponding WAL records, it will be forcibly zeroed before use 
>> anyway.
>>  * Therefore, we don't need to mark our pages with LSN information; we have
>>  * enough synchronization already.
>>
>> The comment's justification is incomplete, though.  What of pages filled over
>> the course of multiple checkpoint cycles?
>
> Having studied this further, I think it is safe for a reason given elsewhere:
>
>          * Note: we need not flush this XLOG entry to disk before proceeding. 
> The
>          * only way for the MXID to be referenced from any data page is for
>          * heap_lock_tuple() to have put it there, and heap_lock_tuple() 
> generates
>          * an XLOG record that must follow ours.  The normal LSN interlock 
> between
>          * the data page and that XLOG record will ensure that our XLOG record
>          * reaches disk first.  If the SLRU members/offsets data reaches disk
>          * sooner than the XLOG record, we do not care because we'll 
> overwrite it
>          * with zeroes unless the XLOG record is there too; see notes at top 
> of
>          * this file.
>
> I find no flaw in the first three sentences.  In most cases, one of
> multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
> overwrite the widowed mxid data with zeroes before the system again writes
> data in that vicinity.  We can fail to do that if a backend stalls just after
> calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
> other backends filled the balance of those pages.  That's still okay; if we
> never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
> will survive recovery.  I drafted the attached comment update to consolidate
> and slightly correct this justification.  (If we were designing the multixact
> subsystem afresh today, I'd vote for following the write-WAL-before-data rule
> despite believing it is not needed.  The simplicity would be worth it.)
>
> While contemplating the "stalls ... just after calling GetNewMultiXactId()"
> scenario, I noticed a race condition not involving any write-WAL-before-data
> violation.  MultiXactIdCreateFromMembers() and callees have this skeleton:
>
> GetNewMultiXactId
>   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
>   ExtendMultiXactOffset()
>   ExtendMultiXactMember()
>   START_CRIT_SECTION()
>   (MultiXactState->nextMXact)++
>   MultiXactState->nextOffset += nmembers
>   LWLockRelease(MultiXactGenLock)
> XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
> RecordNewMultiXact
>   LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
>   *offptr = offset;  /* update pg_multixact/offsets SLRU page */
>   LWLockRelease(MultiXactOffsetControlLock)
>   LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
>   *memberptr = members[i].xid;  /* update pg_multixact/members SLRU page */
>   *flagsptr = flagsval;  /* update pg_multixact/members SLRU page */
>   LWLockRelease(MultiXactMemberControlLock)
> END_CRIT_SECTION
>
> Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
> create higher-numbered mxids.  They will skip over SLRU space the current
> backend reserved.  Future GetMultiXactIdMembers() calls rely critically on the
> current backend eventually finishing:
>
>          * 2. The next multixact may still be in process of being filled in: 
> that
>          * is, another process may have done GetNewMultiXactId but not yet 
> written
>          * the offset entry for that ID.  In that scenario, it is guaranteed 
> that
>          * the offset entry for that multixact exists (because 
> GetNewMultiXactId
>          * won't release MultiXactGenLock until it does) but contains zero
>          * (because we are careful to pre-zero offset pages). Because
>          * GetNewMultiXactId will never return zero as the starting offset 
> for a
>          * multixact, when we read zero as the next multixact's offset, we 
> know we
>          * have this case.  We sleep for a bit and try again.
>
> A crash while paused this way makes permanent the zero offset.  After
> recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
> on the immediate previous mxid hangs indefinitely.  Also,
> pg_get_multixact_members(zero_offset_mxid) gets an assertion failure.  I have
> not thoroughly considered how best to fix these.  Test procedure:
>
> -- backend 1
> checkpoint;
> create table victim0 (c) as select 1;
> create table stall1 (c) as select 1;
> create table last2 (c) as select 1;
> begin;
> select c from victim0 for key share;
> select c from stall1 for key share;
> select c from last2 for key share;
>
> -- backend 2
> begin; update victim0 set c = c + 1; rollback; -- burn one mxid
> -- In a shell, attach GDB to backend 2.  GDB will stop the next SQL command at
> -- XLogInsert() in MultiXactIdCreateFromMembers():
> --   gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
> select c from stall1 for key share; -- stall in gdb while making an mxid
>
> -- backend 3
> select c from last2 for key share; -- use one more mxid; flush WAL
>
> -- in GDB session, issue command: kill
>
> -- backend 1, after recovery
> select c from victim0;          -- hangs, uncancelable
>
> -- backend 2, after recovery: assertion failure
> select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from 
> last2;


For tracking purposes, I updated
https://wiki.postgresql.org/wiki/MultiXact_Bugs and added this. We're
probably past due to spend some time cleaning up the older issues that
Andres's patch didn't fix (although it fixed a lot).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Reply via email to