On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote: > On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote: > > /* > > * Optional array of WAL flush LSNs associated with entries in the SLRU > > * pages. If not zero/NULL, we must flush WAL before writing pages > > (true > > * for pg_clog, false for multixact, pg_subtrans, pg_notify). > > group_lsn[] > > * has lsn_groups_per_page entries per buffer slot, each containing the > > * highest LSN known for a contiguous group of SLRU entries on that > > slot's > > * page. > > */ > > XLogRecPtr *group_lsn; > > int lsn_groups_per_page; > >
> Here's the multixact.c comment justifying it: > > * XLOG interactions: this module generates an XLOG record whenever a new > * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record > * whenever a new MultiXactId is defined. This allows us to completely > * rebuild the data entered since the last checkpoint during XLOG replay. > * Because this is possible, we need not follow the normal rule of > * "write WAL before data"; the only correctness guarantee needed is that > * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a > * checkpoint is considered complete. If a page does make it to disk ahead > * of corresponding WAL records, it will be forcibly zeroed before use anyway. > * Therefore, we don't need to mark our pages with LSN information; we have > * enough synchronization already. > > The comment's justification is incomplete, though. What of pages filled over > the course of multiple checkpoint cycles? Having studied this further, I think it is safe for a reason given elsewhere: * Note: we need not flush this XLOG entry to disk before proceeding. The * only way for the MXID to be referenced from any data page is for * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates * an XLOG record that must follow ours. The normal LSN interlock between * the data page and that XLOG record will ensure that our XLOG record * reaches disk first. If the SLRU members/offsets data reaches disk * sooner than the XLOG record, we do not care because we'll overwrite it * with zeroes unless the XLOG record is there too; see notes at top of * this file. I find no flaw in the first three sentences. In most cases, one of multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed overwrite the widowed mxid data with zeroes before the system again writes data in that vicinity. We can fail to do that if a backend stalls just after calling GetNewMultiXactId(), then recovers and updates SLRU pages long after other backends filled the balance of those pages. That's still okay; if we never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid will survive recovery. I drafted the attached comment update to consolidate and slightly correct this justification. (If we were designing the multixact subsystem afresh today, I'd vote for following the write-WAL-before-data rule despite believing it is not needed. The simplicity would be worth it.) While contemplating the "stalls ... just after calling GetNewMultiXactId()" scenario, I noticed a race condition not involving any write-WAL-before-data violation. MultiXactIdCreateFromMembers() and callees have this skeleton: GetNewMultiXactId LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE) ExtendMultiXactOffset() ExtendMultiXactMember() START_CRIT_SECTION() (MultiXactState->nextMXact)++ MultiXactState->nextOffset += nmembers LWLockRelease(MultiXactGenLock) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID) RecordNewMultiXact LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE) *offptr = offset; /* update pg_multixact/offsets SLRU page */ LWLockRelease(MultiXactOffsetControlLock) LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE) *memberptr = members[i].xid; /* update pg_multixact/members SLRU page */ *flagsptr = flagsval; /* update pg_multixact/members SLRU page */ LWLockRelease(MultiXactMemberControlLock) END_CRIT_SECTION Between GetNewMultiXactId() and XLogInsert(), other backends will be free to create higher-numbered mxids. They will skip over SLRU space the current backend reserved. Future GetMultiXactIdMembers() calls rely critically on the current backend eventually finishing: * 2. The next multixact may still be in process of being filled in: that * is, another process may have done GetNewMultiXactId but not yet written * the offset entry for that ID. In that scenario, it is guaranteed that * the offset entry for that multixact exists (because GetNewMultiXactId * won't release MultiXactGenLock until it does) but contains zero * (because we are careful to pre-zero offset pages). Because * GetNewMultiXactId will never return zero as the starting offset for a * multixact, when we read zero as the next multixact's offset, we know we * have this case. We sleep for a bit and try again. A crash while paused this way makes permanent the zero offset. After recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers() on the immediate previous mxid hangs indefinitely. Also, pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have not thoroughly considered how best to fix these. Test procedure: -- backend 1 checkpoint; create table victim0 (c) as select 1; create table stall1 (c) as select 1; create table last2 (c) as select 1; begin; select c from victim0 for key share; select c from stall1 for key share; select c from last2 for key share; -- backend 2 begin; update victim0 set c = c + 1; rollback; -- burn one mxid -- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at -- XLogInsert() in MultiXactIdCreateFromMembers(): -- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c select c from stall1 for key share; -- stall in gdb while making an mxid -- backend 3 select c from last2 for key share; -- use one more mxid; flush WAL -- in GDB session, issue command: kill -- backend 1, after recovery select c from victim0; -- hangs, uncancelable -- backend 2, after recovery: assertion failure select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index b66a2b6..c027f68 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -24,17 +24,21 @@ * since it would get completely confused if someone inquired about a bogus * MultiXactId that pointed to an intermediate slot containing an XID.) * - * XLOG interactions: this module generates an XLOG record whenever a new - * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record - * whenever a new MultiXactId is defined. This allows us to completely - * rebuild the data entered since the last checkpoint during XLOG replay. - * Because this is possible, we need not follow the normal rule of - * "write WAL before data"; the only correctness guarantee needed is that - * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a - * checkpoint is considered complete. If a page does make it to disk ahead - * of corresponding WAL records, it will be forcibly zeroed before use anyway. - * Therefore, we don't need to mark our pages with LSN information; we have - * enough synchronization already. + * XLOG interactions: this module generates a record whenever a new OFFSETs or + * MEMBERs page is initialized to zeroes, as well as an + * XLOG_MULTIXACT_CREATE_ID record whenever a new MultiXactId is defined. + * This module ignores the WAL rule "write xlog before data," because it + * suffices that actions recording a MultiXactId in a heap xmax do follow that + * rule. The only way for the MXID to be referenced from any data page is for + * heap_lock_tuple() or heap_update() to have put it there, and each generates + * an XLOG record that must follow ours. The normal LSN interlock between the + * data page and that XLOG record will ensure that our XLOG record reaches + * disk first. If the SLRU members/offsets data reaches disk sooner than the + * XLOG records, we do not care; after recovery, no xmax will refer to it. On + * the flip side, to ensure that all referenced entries _do_ reach disk, this + * module's XLOG records completely rebuild the data entered since the last + * checkpoint. We flush and sync all dirty OFFSETs and MEMBERs pages to disk + * before each checkpoint is considered complete. * * Like clog.c, and unlike subtrans.c, we have to preserve state across * crashes and ensure that MXID and offset numbering increases monotonically @@ -795,19 +799,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members) */ multi = GetNewMultiXactId(nmembers, &offset); - /* - * Make an XLOG entry describing the new MXID. - * - * Note: we need not flush this XLOG entry to disk before proceeding. The - * only way for the MXID to be referenced from any data page is for - * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates - * an XLOG record that must follow ours. The normal LSN interlock between - * the data page and that XLOG record will ensure that our XLOG record - * reaches disk first. If the SLRU members/offsets data reaches disk - * sooner than the XLOG record, we do not care because we'll overwrite it - * with zeroes unless the XLOG record is there too; see notes at top of - * this file. - */ + /* Make an XLOG entry describing the new MXID. */ xlrec.mid = multi; xlrec.moff = offset; xlrec.nmembers = nmembers; @@ -2037,7 +2029,11 @@ TrimMultiXact(void) /* * Zero out the remainder of the current offsets page. See notes in - * TrimCLOG() for motivation. + * TrimCLOG() for background. Unlike CLOG, some WAL record covers every + * pg_multixact SLRU mutation. Since, also unlike CLOG, we ignore the WAL + * rule "write xlog before data," nextMXact successors may carry obsolete, + * nonzero offset values. Zero those so case 2 of GetMultiXactIdMembers() + * operates normally. */ entryno = MultiXactIdToOffsetEntry(nextMXact); if (entryno != 0)
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers