On 2015-01-29 11:01:51 -0500, Robert Haas wrote: > On Wed, Jan 28, 2015 at 2:41 AM, Michael Paquier > <michael.paqu...@gmail.com> wrote: > > Andres Freund wrote: > >> I think this isn't particularly pretty, but it seems to be working well > >> enough, and changing it would be pretty invasive. So keeping in line > >> with all that code seems to be easier. > > OK, I'm convinced with this part to remove the call of > > LockSharedObjectForSession that uses dontWait and replace it by a loop > > in ResolveRecoveryConflictWithDatabase. > > That seems right to me, too.
It's slightly more complicated than that. The lock conflict should actually be resolved using ResolveRecoveryConflictWithLock()... That, combined with the race of connecting a actually already deleted database (see the XXXs I removed) seem to make the approach in here. Attached are two patches: 1) Infrastructure for attaching more kinds of locks on the startup process. 2) Use that infrastructure for database locks during replay. I'm not sure 2) alone would be sufficient justification for 1), but the nearby thread about basebackups also require similar infrastructure... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>From 8531ca4d200d24ee45265774f7ead613563adca4 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 30 Jan 2015 09:28:17 +0100 Subject: [PATCH 1/4] Allow recovery lock infrastructure to not only hold relation locks. This is a preparatory commit to fix several bugs, separated out for easier review. The startup process could so far only properly acquire access exlusive locks on transactions. As it does not setup enough state to do lock queuing - primarily to be able to issue recovery conflict interrupts, and to release them when the primary releases locks - it has it's own infrastructure to manage locks. That infrastructure so far assumed that all locks are access exlusive locks on relations. Unfortunately some code in the startup process has to acquire other locks than what's supported by the aforementioned infrastructure in standby.c. Namely dbase_redo() has to acquire locks on the database objects. Also further such locks will be added soon, to fix a another bug. So this patch shanges the infrastructure to be able to acquire locks of different modes and locktags. Additionally allow acquiring more heavyweight relation logs on the standby than RowExclusive when acquired in session mode. Discussion: 20150120152819.gc24...@alap3.anarazel.de Backpatch all the way. --- src/backend/storage/ipc/standby.c | 120 ++++++++++++++++++-------------------- src/backend/storage/lmgr/lock.c | 7 +-- src/include/storage/standby.h | 2 +- 3 files changed, 61 insertions(+), 68 deletions(-) diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index 292bed5..0502aab 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -38,10 +38,16 @@ int max_standby_archive_delay = 30 * 1000; int max_standby_streaming_delay = 30 * 1000; static List *RecoveryLockList; +typedef struct RecoveryLockListEntry +{ + TransactionId xid; + LOCKMODE lockmode; + LOCKTAG locktag; +} RecoveryLockListEntry; static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, ProcSignalReason reason); -static void ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid); +static void ResolveRecoveryConflictWithLock(LOCKTAG *tag, LOCKMODE mode); static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason); static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts); static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks); @@ -320,10 +326,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) * us. This is rare enough that we do this as simply as possible: no wait, * just force them off immediately. * - * No locking is required here because we already acquired - * AccessExclusiveLock. Anybody trying to connect while we do this will - * block during InitPostgres() and then disconnect when they see the - * database has been removed. + * No locking is required here because we already acquired a + * AccessExclusiveLock on the database in dbase_redo(). Anybody trying to + * connect while we do this will block during InitPostgres() and then + * disconnect when they see the database has been removed. */ while (CountDBBackends(dbid) > 0) { @@ -338,14 +344,11 @@ ResolveRecoveryConflictWithDatabase(Oid dbid) } static void -ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) +ResolveRecoveryConflictWithLock(LOCKTAG *locktag, LOCKMODE mode) { VirtualTransactionId *backends; bool lock_acquired = false; int num_attempts = 0; - LOCKTAG locktag; - - SET_LOCKTAG_RELATION(locktag, dbOid, relOid); /* * If blowing away everybody with conflicting locks doesn't work, after @@ -358,7 +361,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) while (!lock_acquired) { if (++num_attempts < 3) - backends = GetLockConflicts(&locktag, AccessExclusiveLock); + backends = GetLockConflicts(locktag, mode); else backends = GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid); @@ -366,7 +369,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid) ResolveRecoveryConflictWithVirtualXIDs(backends, PROCSIG_RECOVERY_CONFLICT_LOCK); - if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false) + if (LockAcquireExtended(locktag, mode, true, true, false) != LOCKACQUIRE_NOT_AVAIL) lock_acquired = true; } @@ -544,19 +547,18 @@ StandbyTimeoutHandler(void) * this lock, so query access is not allowed at this time". So the Startup * process is the proxy by which the original locks are implemented. * - * We only keep track of AccessExclusiveLocks, which are only ever held by - * one transaction on one relation, and don't worry about lock queuing. + * We only keep track of the primary's AccessExclusiveLocks, which are only + * ever held by one transaction on one relation, and don't worry about lock + * queuing. The startup process however does acquire other locks occasionally + * (c.f. dbase_redo()) - but even there no queuing is possible. * * We keep a single dynamically expandible list of locks in local memory, * RelationLockList, so we can keep track of the various entries made by * the Startup process's virtual xid in the shared lock table. * * We record the lock against the top-level xid, rather than individual - * subtransaction xids. This means AccessExclusiveLocks held by aborted - * subtransactions are not released as early as possible on standbys. - * - * List elements use type xl_rel_lock, since the WAL record type exactly - * matches the information that we need to keep track of. + * subtransaction xids. This means locks held by aborted subtransactions are + * not released as early as possible on standbys. * * We use session locks rather than normal locks so we don't need * ResourceOwners. @@ -564,10 +566,11 @@ StandbyTimeoutHandler(void) void -StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) +StandbyAcquireLock(TransactionId xid, LOCKTAG *locktag, LOCKMODE mode) { - xl_standby_lock *newlock; - LOCKTAG locktag; + RecoveryLockListEntry *newlock; + + Assert(locktag->locktag_lockmethodid == DEFAULT_LOCKMETHOD); /* Already processed? */ if (!TransactionIdIsValid(xid) || @@ -575,26 +578,18 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) TransactionIdDidAbort(xid)) return; - elog(trace_recovery(DEBUG4), - "adding recovery lock: db %u rel %u", dbOid, relOid); - - /* dbOid is InvalidOid when we are locking a shared relation. */ - Assert(OidIsValid(relOid)); - - newlock = palloc(sizeof(xl_standby_lock)); + newlock = palloc(sizeof(RecoveryLockListEntry)); newlock->xid = xid; - newlock->dbOid = dbOid; - newlock->relOid = relOid; + newlock->lockmode = mode; + memcpy(&newlock->locktag, locktag, sizeof(LOCKTAG)); RecoveryLockList = lappend(RecoveryLockList, newlock); /* * Attempt to acquire the lock as requested, if not resolve conflict */ - SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid); - - if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false) + if (LockAcquireExtended(locktag, mode, true, true, false) == LOCKACQUIRE_NOT_AVAIL) - ResolveRecoveryConflictWithLock(newlock->dbOid, newlock->relOid); + ResolveRecoveryConflictWithLock(locktag, mode); } static void @@ -610,22 +605,16 @@ StandbyReleaseLocks(TransactionId xid) prev = NULL; for (cell = list_head(RecoveryLockList); cell; cell = next) { - xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); + RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell); next = lnext(cell); if (!TransactionIdIsValid(xid) || lock->xid == xid) { - LOCKTAG locktag; - - elog(trace_recovery(DEBUG4), - "releasing recovery lock: xid %u db %u rel %u", - lock->xid, lock->dbOid, lock->relOid); - SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); - if (!LockRelease(&locktag, AccessExclusiveLock, true)) + if (!LockRelease(&lock->locktag, lock->lockmode, true)) elog(LOG, - "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u", - lock->xid, lock->dbOid, lock->relOid); + "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u", + lock->xid); RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev); pfree(lock); @@ -662,25 +651,24 @@ StandbyReleaseAllLocks(void) ListCell *cell, *prev, *next; - LOCKTAG locktag; elog(trace_recovery(DEBUG2), "release all standby locks"); prev = NULL; for (cell = list_head(RecoveryLockList); cell; cell = next) { - xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); + RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell); next = lnext(cell); elog(trace_recovery(DEBUG4), - "releasing recovery lock: xid %u db %u rel %u", - lock->xid, lock->dbOid, lock->relOid); - SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); - if (!LockRelease(&locktag, AccessExclusiveLock, true)) + "releasing recovery lock for xid %u", + lock->xid); + + if (!LockRelease(&lock->locktag, lock->lockmode, true)) elog(LOG, - "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u", - lock->xid, lock->dbOid, lock->relOid); + "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u", + lock->xid); RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev); pfree(lock); } @@ -697,12 +685,11 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids) ListCell *cell, *prev, *next; - LOCKTAG locktag; prev = NULL; for (cell = list_head(RecoveryLockList); cell; cell = next) { - xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); + RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell); bool remove = false; next = lnext(cell); @@ -735,13 +722,13 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids) if (remove) { elog(trace_recovery(DEBUG4), - "releasing recovery lock: xid %u db %u rel %u", - lock->xid, lock->dbOid, lock->relOid); - SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); - if (!LockRelease(&locktag, AccessExclusiveLock, true)) + "releasing recovery lock: xid %u", + lock->xid); + + if (!LockRelease(&lock->locktag, lock->lockmode, true)) elog(LOG, - "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u", - lock->xid, lock->dbOid, lock->relOid); + "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u", + lock->xid); RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev); pfree(lock); } @@ -776,9 +763,16 @@ standby_redo(XLogReaderState *record) int i; for (i = 0; i < xlrec->nlocks; i++) - StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid, - xlrec->locks[i].dbOid, - xlrec->locks[i].relOid); + { + LOCKTAG locktag; + + SET_LOCKTAG_RELATION(locktag, + xlrec->locks[i].dbOid, + xlrec->locks[i].relOid); + StandbyAcquireLock(xlrec->locks[i].xid, + &locktag, + AccessExclusiveLock); + } } else if (info == XLOG_RUNNING_XACTS) { diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 1eb2d4b..02ecf3d 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -710,7 +710,8 @@ LockAcquireExtended(const LOCKTAG *locktag, if (RecoveryInProgress() && !InRecovery && (locktag->locktag_type == LOCKTAG_OBJECT || locktag->locktag_type == LOCKTAG_RELATION) && - lockmode > RowExclusiveLock) + lockmode > RowExclusiveLock && + !sessionLock) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot acquire lock mode %s on database objects while recovery is in progress", @@ -3861,9 +3862,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info, if (lockmode == AccessExclusiveLock && locktag->locktag_type == LOCKTAG_RELATION) { - StandbyAcquireAccessExclusiveLock(xid, - locktag->locktag_field1 /* dboid */ , - locktag->locktag_field2 /* reloid */ ); + StandbyAcquireLock(xid, locktag, AccessExclusiveLock); } } diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h index c32c963..f711281 100644 --- a/src/include/storage/standby.h +++ b/src/include/storage/standby.h @@ -45,7 +45,7 @@ extern void StandbyTimeoutHandler(void); * to make hot standby work. That includes logging AccessExclusiveLocks taken * by transactions and running-xacts snapshots. */ -extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid); +extern void StandbyAcquireLock(TransactionId xid, LOCKTAG *tag, LOCKMODE mode); extern void StandbyReleaseLockTree(TransactionId xid, int nsubxids, TransactionId *subxids); extern void StandbyReleaseAllLocks(void); -- 2.2.1.212.gc5b9256
>From 4b36f4c0e1efecbd50918a2c854bfe6f26105201 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 30 Jan 2015 09:46:15 +0100 Subject: [PATCH 2/4] Fix startup process crash and race during CREATE/DROP/ALTER DATABASE. When replaying database related WAL records the startup process acquires lock on the database object to prevent new connections to the affected database. That's necessary because we currently don't replicate object locks from the primary. Unfortunately that locking couldn't use the standby infrastructure for managing locks and instead just used a normal LockSharedObjectForSession(). While that works in the (common!) uncontended situation, it doesn't if the lock is currently already acquired. The standby process intentionally doesn't set up the infrastructure to wait for locks in the normal, blocking, way. The goal is to cancel other lock holder after max_standby_*_delay, which is only possible if the sleeping is controlled. If we have to wait anyway, we either crash with a assertion or a FATAL (promoted to PANIC) error. There's also another race, which was documented in comments. Namely that not using the standby lock infrastructure requires explicitly releasing the lock. That means there's a short phase where the catalog and physical state of the database are inconsistent. After the previous commit we can now use the standby infrastructure to manage the lock. That requires assigning a transactionid to the create/drop database transactions. To handle WAL generated by a older point release we thus need to make the locking conditional on the relevant records having a transactionid. On the primary WAL format is bumped instead. Additionally add previously missing locking for CREATE DATABASE that could lead to checksum failures due to hint bit writes. Now that we don't do acquire any locks in potentially blocking mode in the startup process anymore, add a assertion ensuring that that stays so in the primary. Backpatch all the way. Discussion: 20150120152819.gc24...@alap3.anarazel.de --- src/backend/commands/dbcommands.c | 71 +++++++++++++++++++++++++++++++-------- src/backend/storage/lmgr/lock.c | 2 ++ 2 files changed, 59 insertions(+), 14 deletions(-) diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 5e66961..3845eb7 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1204,6 +1204,12 @@ movedb(const char *dbname, const char *tblspcname) } /* + * Force xid assignment, for the benefit of dbase_redo()'s internal + * locking. + */ + GetTopTransactionId(); + + /* * Use an ENSURE block to make sure we remove the debris if the copy fails * (eg, due to out-of-disk-space). This is not a 100% solution, because * of the possibility of failure during transaction commit, but it should @@ -1314,6 +1320,12 @@ movedb(const char *dbname, const char *tblspcname) StartTransactionCommand(); /* + * Force xid assignment, for the benefit of dbase_redo()'s internal + * locking. + */ + GetTopTransactionId(); + + /* * Remove files from the old tablespace */ if (!rmtree(src_dbpath, true)) @@ -2053,6 +2065,34 @@ dbase_redo(XLogReaderState *record) dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); /* + * Can only do the locking if the server generating the record was new + * enough to know it has to assign a xid. + */ + if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record))) + { + LOCKTAG locktag; + + SET_LOCKTAG_OBJECT(locktag, + InvalidOid, + DatabaseRelationId, + xlrec->src_db_id, + 0); + + /* Lock source database, do avoid concurrent hint bit writes et al. */ + StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock); + ResolveRecoveryConflictWithDatabase(xlrec->src_db_id); + + /* Lock target database, it'll be overwritten in a second */ + SET_LOCKTAG_OBJECT(locktag, + InvalidOid, + DatabaseRelationId, + xlrec->db_id, + 0); + StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock); + ResolveRecoveryConflictWithDatabase(xlrec->src_db_id); + } + + /* * Our theory for replaying a CREATE is to forcibly drop the target * subdirectory if present, then re-copy the source data. This may be * more work than needed, but it is simple to implement. @@ -2086,16 +2126,31 @@ dbase_redo(XLogReaderState *record) dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); - if (InHotStandby) + /* + * Can only do the locking if the server generating the record was new + * enough to know it has to assign a xid. + */ + if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record))) { + LOCKTAG locktag; + /* * Lock database while we resolve conflicts to ensure that * InitPostgres() cannot fully re-execute concurrently. This * avoids backends re-connecting automatically to same database, * which can happen in some cases. */ - LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock); + SET_LOCKTAG_OBJECT(locktag, + InvalidOid, + DatabaseRelationId, + xlrec->db_id, + 0); + StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock); ResolveRecoveryConflictWithDatabase(xlrec->db_id); + + /* Lock pg_database, to conflict with base backups */ + SET_LOCKTAG_RELATION(locktag, 0, DatabaseRelationId); + StandbyAcquireLock(XLogRecGetXid(record), &locktag, RowExclusiveLock); } /* Drop pages for this database that are in the shared buffer cache */ @@ -2112,18 +2167,6 @@ dbase_redo(XLogReaderState *record) ereport(WARNING, (errmsg("some useless files may be left behind in old database directory \"%s\"", dst_path))); - - if (InHotStandby) - { - /* - * Release locks prior to commit. XXX There is a race condition - * here that may allow backends to reconnect, but the window for - * this is small because the gap between here and commit is mostly - * fairly small and it is unlikely that people will be dropping - * databases that we are trying to connect to anyway. - */ - UnlockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock); - } } else elog(PANIC, "dbase_redo: unknown op code %u", info); diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 02ecf3d..f051ad6 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -718,6 +718,8 @@ LockAcquireExtended(const LOCKTAG *locktag, lockMethodTable->lockModeNames[lockmode]), errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery."))); + Assert(!InRecovery || (sessionLock && dontWait)); + #ifdef LOCK_DEBUG if (LOCK_DEBUG_ENABLED(locktag)) elog(LOG, "LockAcquire: lock [%u,%u] %s", -- 2.2.1.212.gc5b9256
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers