On Wed, 2010-05-05 at 09:12 +0300, Heikki Linnakangas wrote: > I concur that the idea is that we deal at replay with the fact that the > snapshot lags behind. At replay, any locks/XIDs in the snapshot that > have already been committed/aborted are ignored. For any locks/XIDs > taken just after the snapshot was taken, the replay will see the other > WAL records with that information. > > We need to add comments explaining all that.
The attached comments are proposed. Reviewing this information again to propose a fix for the two minor other bugs pointed out by Tom show that they are both related and need one combined fix that would work like this: Currently we handle the state STANDBY_INITIALIZED incorrectly. We need to run RecordKnownAssignedXids() during this mode, so that we both extend the clog and record known xids. That means that two other callers of RecordKnownAssignedXids also need to call it at that time. In ProcArrayApplyRecoveryInfo() we run KnownAssignedXidsAdd(), though this will fail if there are existing xids in there, now it is sorted. So we need to: run KnownAssignedXidsRemovePreceding(latestObservedXid) to remove extraneous xids, then extract any xids that remain and add them to the ones arriving with the running xacts record. We then need to sort the combined array and re-insert into KnownAssignedXids. Previously, I had imagined that the gap between the logical checkpoint and the physical checkpoint was small. With spread checkpoints this isn't the case any longer. So I propose adding a special WAL record that is inserted during LogStandbySnapshot() immediately before GetRunningTransactionLocks(), so that we minimise the time window between deriving snapshot data and recording it in WAL. Those changes are not especially invasive. -- Simon Riggs www.2ndQuadrant.com
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c index ab4ef62..434fffb 100644 --- a/src/backend/storage/ipc/standby.c +++ b/src/backend/storage/ipc/standby.c @@ -86,6 +86,58 @@ InitRecoveryTransactionEnvironment(void) vxid.localTransactionId = GetNextLocalTransactionId(); VirtualXactLockTableInsert(vxid); + /* + * We can only move directly to STANDBY_SNAPSHOT_READY at startup if we + * start from a shutdown checkpoint. In the case of starting from an + * online checkpoint the situation is more complex and requires a two + * or sometimes a three stage process. + * + * standbyState starts here at STANDBY_INITIALIZED and changes state to + * either STANDBY_SNAPSHOT_PENDING or STANDBY_SNAPSHOT_READY. If we are + * at STANDBY_SNAPSHOT_PENDING state we can only change to + * STANDBY_SNAPSHOT_READY at which we stay until shutdown. + * + * The initial snapshot must contain all running xids and all current + * AccessExclusiveLocks at a point in time on the standby. Assembling + * that information requires many and various LWLocks, so we choose to + * derive that information piece by piece and then re-assemble that info + * on the standby. When that information is fully assembled we move to + * STANDBY_SNAPSHOT_READY. + * + * Since locking on the primary when we derive the information is not + * strict, we note that there is a time window between the derivation and + * writing to WAL of the derived information. That allows race conditions + * that we must resolve, since xids and locks may enter or leave the + * snapshot during that window. This creates the issue that an xid or + * lock may start *after* the snapshot has been derived yet *before* the + * snapshot is logged in the running xacts WAL record. We resolve this by + * starting to accumulate changes at a point immediately before we derive + * the snapshot on the primary and ignore duplicates when we later apply + * the snapshot from the running xacts record. This is implemented during + * CreateCheckpoint() where we use the logical checkpoint location as + * our starting point and then write the running xacts record immediately + * before writing the main checkpoint WAL record. Since we always start + * up from a checkpoint and we are immediately at our starting point, so + * we unconditionally move to STANDBY_INITIALIZED. After this point we + * must do 4 things: + * * move shared nextXid forwards as we see new xids + * * extend the clog and subtrans with the new xid + * * keep track of uncommitted known assigned xids + * * keep track of uncommitted AccessExclusiveLocks + * + * When we see a commit/abort we must remove known assigned xids and locks + * from the completing transaction. Attempted removals that cannot locate + * an entry are expected and must not cause an error when we are in state + * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and + * KnownAssignedXidsRemove(). + * + * Later, when we apply the running xact data we must be careful to ignore + * transactions already committed, since those commits raced ahead when + * making WAL entries. + * + * XXX We can further optimize LWlocking by keeping track of whether any + * AccessExclusiveLocks exist. + */ standbyState = STANDBY_INITIALIZED; }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers