Re: [HACKERS] LogStandbySnapshot (was another thread)

Simon Riggs Thu, 06 May 2010 01:04:47 -0700

On Wed, 2010-05-05 at 09:12 +0300, Heikki Linnakangas wrote:

> I concur that the idea is that we deal at replay with the fact that the
> snapshot lags behind. At replay, any locks/XIDs in the snapshot that
> have already been committed/aborted are ignored. For any locks/XIDs
> taken just after the snapshot was taken, the replay will see the other
> WAL records with that information.
> 
> We need to add comments explaining all that.


The attached comments are proposed.

Reviewing this information again to propose a fix for the two minor
other bugs pointed out by Tom show that they are both related and need
one combined fix that would work like this:

Currently we handle the state STANDBY_INITIALIZED incorrectly. We need
to run RecordKnownAssignedXids() during this mode, so that we both
extend the clog and record known xids. That means that two other callers
of RecordKnownAssignedXids also need to call it at that time.

In ProcArrayApplyRecoveryInfo() we run KnownAssignedXidsAdd(), though
this will fail if there are existing xids in there, now it is sorted. So
we need to: run KnownAssignedXidsRemovePreceding(latestObservedXid) to
remove extraneous xids, then extract any xids that remain and add them
to the ones arriving with the running xacts record. We then need to sort
the combined array and re-insert into KnownAssignedXids.

Previously, I had imagined that the gap between the logical checkpoint
and the physical checkpoint was small. With spread checkpoints this
isn't the case any longer. So I propose adding a special WAL record that
is inserted during LogStandbySnapshot() immediately before
GetRunningTransactionLocks(), so that we minimise the time window
between deriving snapshot data and recording it in WAL.

Those changes are not especially invasive.

-- 
 Simon Riggs           www.2ndQuadrant.com

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index ab4ef62..434fffb 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -86,6 +86,58 @@ InitRecoveryTransactionEnvironment(void)
 	vxid.localTransactionId = GetNextLocalTransactionId();
 	VirtualXactLockTableInsert(vxid);
 
+	/*
+	 * We can only move directly to STANDBY_SNAPSHOT_READY at startup if we
+	 * start from a shutdown checkpoint. In the case of starting from an
+	 * online checkpoint the situation is more complex and requires a two
+	 * or sometimes a three stage process.
+	 *
+	 * standbyState starts here at STANDBY_INITIALIZED and changes state to
+	 * either STANDBY_SNAPSHOT_PENDING or STANDBY_SNAPSHOT_READY. If we are
+	 * at STANDBY_SNAPSHOT_PENDING state we can only change to
+	 * STANDBY_SNAPSHOT_READY at which we stay until shutdown.
+	 * 
+	 * The initial snapshot must contain all running xids and all current
+	 * AccessExclusiveLocks at a point in time on the standby. Assembling
+	 * that information requires many and various LWLocks, so we choose to
+	 * derive that information piece by piece and then re-assemble that info
+	 * on the standby. When that information is fully assembled we move to
+	 * STANDBY_SNAPSHOT_READY.
+	 *
+	 * Since locking on the primary when we derive the information is not
+	 * strict, we note that there is a time window between the derivation and
+	 * writing to WAL of the derived information. That allows race conditions
+	 * that we must resolve, since xids and locks may enter or leave the
+	 * snapshot during that window. This creates the issue that an xid or
+	 * lock may start *after* the snapshot has been derived yet *before* the
+	 * snapshot is logged in the running xacts WAL record. We resolve this by
+	 * starting to accumulate changes at a point immediately before we derive
+	 * the snapshot on the primary and ignore duplicates when we later apply
+	 * the snapshot from the running xacts record. This is implemented during
+	 * CreateCheckpoint() where we use the logical checkpoint location as
+	 * our starting point and then write the running xacts record immediately
+	 * before writing the main checkpoint WAL record. Since we always start
+	 * up from a checkpoint and we are immediately at our starting point, so
+	 * we unconditionally move to STANDBY_INITIALIZED. After this point we
+	 * must do 4 things: 
+	 *  * move shared nextXid forwards as we see new xids
+	 *  * extend the clog and subtrans with the new xid
+	 *  * keep track of uncommitted known assigned xids
+	 *  * keep track of uncommitted AccessExclusiveLocks
+	 *
+	 * When we see a commit/abort we must remove known assigned xids and locks
+	 * from the completing transaction. Attempted removals that cannot locate
+	 * an entry are expected and must not cause an error when we are in state
+	 * STANDBY_INITIALIZED. This is implemented in StandbyReleaseLocks() and
+	 * KnownAssignedXidsRemove().
+	 * 
+	 * Later, when we apply the running xact data we must be careful to ignore
+	 * transactions already committed, since those commits raced ahead when
+	 * making WAL entries.
+	 *
+	 * XXX We can further optimize LWlocking by keeping track of whether any
+	 * AccessExclusiveLocks exist.
+	 */
 	standbyState = STANDBY_INITIALIZED;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LogStandbySnapshot (was another thread)

Reply via email to