On Sat, Nov 28, 2015 at 11:15 PM, Noah Misch <n...@leadboat.com> wrote: > On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote: >> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote: >> > /* >> > * Optional array of WAL flush LSNs associated with entries in the SLRU >> > * pages. If not zero/NULL, we must flush WAL before writing pages >> > (true >> > * for pg_clog, false for multixact, pg_subtrans, pg_notify). >> > group_lsn[] >> > * has lsn_groups_per_page entries per buffer slot, each containing the >> > * highest LSN known for a contiguous group of SLRU entries on that >> > slot's >> > * page. >> > */ >> > XLogRecPtr *group_lsn; >> > int lsn_groups_per_page; >> > > >> Here's the multixact.c comment justifying it: >> >> * XLOG interactions: this module generates an XLOG record whenever a new >> * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG >> record >> * whenever a new MultiXactId is defined. This allows us to completely >> * rebuild the data entered since the last checkpoint during XLOG replay. >> * Because this is possible, we need not follow the normal rule of >> * "write WAL before data"; the only correctness guarantee needed is that >> * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a >> * checkpoint is considered complete. If a page does make it to disk ahead >> * of corresponding WAL records, it will be forcibly zeroed before use >> anyway. >> * Therefore, we don't need to mark our pages with LSN information; we have >> * enough synchronization already. >> >> The comment's justification is incomplete, though. What of pages filled over >> the course of multiple checkpoint cycles? > > Having studied this further, I think it is safe for a reason given elsewhere: > > * Note: we need not flush this XLOG entry to disk before proceeding. > The > * only way for the MXID to be referenced from any data page is for > * heap_lock_tuple() to have put it there, and heap_lock_tuple() > generates > * an XLOG record that must follow ours. The normal LSN interlock > between > * the data page and that XLOG record will ensure that our XLOG record > * reaches disk first. If the SLRU members/offsets data reaches disk > * sooner than the XLOG record, we do not care because we'll > overwrite it > * with zeroes unless the XLOG record is there too; see notes at top > of > * this file. > > I find no flaw in the first three sentences. In most cases, one of > multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed > overwrite the widowed mxid data with zeroes before the system again writes > data in that vicinity. We can fail to do that if a backend stalls just after > calling GetNewMultiXactId(), then recovers and updates SLRU pages long after > other backends filled the balance of those pages. That's still okay; if we > never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid > will survive recovery. I drafted the attached comment update to consolidate > and slightly correct this justification. (If we were designing the multixact > subsystem afresh today, I'd vote for following the write-WAL-before-data rule > despite believing it is not needed. The simplicity would be worth it.) > > While contemplating the "stalls ... just after calling GetNewMultiXactId()" > scenario, I noticed a race condition not involving any write-WAL-before-data > violation. MultiXactIdCreateFromMembers() and callees have this skeleton: > > GetNewMultiXactId > LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE) > ExtendMultiXactOffset() > ExtendMultiXactMember() > START_CRIT_SECTION() > (MultiXactState->nextMXact)++ > MultiXactState->nextOffset += nmembers > LWLockRelease(MultiXactGenLock) > XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID) > RecordNewMultiXact > LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE) > *offptr = offset; /* update pg_multixact/offsets SLRU page */ > LWLockRelease(MultiXactOffsetControlLock) > LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE) > *memberptr = members[i].xid; /* update pg_multixact/members SLRU page */ > *flagsptr = flagsval; /* update pg_multixact/members SLRU page */ > LWLockRelease(MultiXactMemberControlLock) > END_CRIT_SECTION > > Between GetNewMultiXactId() and XLogInsert(), other backends will be free to > create higher-numbered mxids. They will skip over SLRU space the current > backend reserved. Future GetMultiXactIdMembers() calls rely critically on the > current backend eventually finishing: > > * 2. The next multixact may still be in process of being filled in: > that > * is, another process may have done GetNewMultiXactId but not yet > written > * the offset entry for that ID. In that scenario, it is guaranteed > that > * the offset entry for that multixact exists (because > GetNewMultiXactId > * won't release MultiXactGenLock until it does) but contains zero > * (because we are careful to pre-zero offset pages). Because > * GetNewMultiXactId will never return zero as the starting offset > for a > * multixact, when we read zero as the next multixact's offset, we > know we > * have this case. We sleep for a bit and try again. > > A crash while paused this way makes permanent the zero offset. After > recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers() > on the immediate previous mxid hangs indefinitely. Also, > pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have > not thoroughly considered how best to fix these. Test procedure: > > -- backend 1 > checkpoint; > create table victim0 (c) as select 1; > create table stall1 (c) as select 1; > create table last2 (c) as select 1; > begin; > select c from victim0 for key share; > select c from stall1 for key share; > select c from last2 for key share; > > -- backend 2 > begin; update victim0 set c = c + 1; rollback; -- burn one mxid > -- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at > -- XLogInsert() in MultiXactIdCreateFromMembers(): > -- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c > select c from stall1 for key share; -- stall in gdb while making an mxid > > -- backend 3 > select c from last2 for key share; -- use one more mxid; flush WAL > > -- in GDB session, issue command: kill > > -- backend 1, after recovery > select c from victim0; -- hangs, uncancelable > > -- backend 2, after recovery: assertion failure > select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from > last2;
For tracking purposes, I updated https://wiki.postgresql.org/wiki/MultiXact_Bugs and added this. We're probably past due to spend some time cleaning up the older issues that Andres's patch didn't fix (although it fixed a lot). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers