Simon Riggs wrote:
On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:

.) Added a new GUC operational_mode, which can be set to either
    readwrite or readonly. If it is set to readwrite (the default),
    postgres behaves as usual. All the following changes are only
    in effect if operational_mode is set to readonly.

Do we need this? We are already InArchiveRecovery.
If I understand you correctly, you suggest that readonly queries
are allways allowed during archive recovery - so upon startup
postgres step through these states:
  .) Initial recovery (Until we reach a consistent state)
  .) Allow readonly queries
  .) Finish recovery in the background (might mean recovering "forever"
     on a PITR slave)
  .) Allow readwrite queries

My plan was to have a global switch, which lets you choose between
  .) All queries are readonly (Until the next postmaster restart at least),
     but you get background replay
  .) No background replay, but once replay is done, readwrite queries
     can be execute (Just what PG does now).

The main reason why I invented that global switch operational_mode was
to remove to need to switch between readonly mode and readwrite mode
on the fly.

.) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
    if postgre is not in readwrite mode. This macro protects the
    following functions to make sure that no writes occur in
    readonly mode.
      SimpleLruWritePage, SLruPhysicalWritePage
      EndPrepare, FinishPreparedTransaction
      XLogInsert, XLogWrite, ShutdownXLog
      CreateCheckpoint
      MarkBufferDirty.

These are Asserts?
The macro ASSUME_OPMODE_READWRITE just does
if (!OperationalModeReadWrite)
  elog(ERROR, ...)

  .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
     is started, and it takes over that role that bgwriter play in the
     shutdown process.

Autovacuum -> understood.

What does bgreplay do? Why not just start bgwriter earlier and disable
some of its other functionality while InRecovery?
See above - it seemed simpler to clearly seperate

  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
     is considered to be "later" than any other xid.

So you are bumping FirstNormalTransactionId up by one for this?
In fact I changed MaxTransactionId to 0xFFFFFFFE, and set
ReadOnlyTransactionId to 0xFFFFFFFF. Additionally, I changed
IsNormalTransactionId to test not only for >= FirstNormalTransactionid,
but also for <= MaxTransactionId.

You're assuming then that we will "freeze" replay while we run a query?
No. My plan is to first get to a point where replay is freezes while
queries are running, and to then figure out a more intelligent way to do this.
I already have a few ideas how to do this, but I want to complete the "simple version", before I start with that work.

Otherwise doing this will mean the snapshot changes as a query executes.
Why? It's only the xid of the transaction, not it's xmin and xmax that are
set to ReadOnlyTransactionId.

  .) A global ReadOnlySnapshot is maintained in shared memory. This is
     copied into backend local memory by GetReadonlySnapshotData (which
     replaces GetSnapshotData in readonly mode).
   .) Crash recovery is not performed in readonly mode - instead, postgres
      PANICs, and tells the DBA to restart in readwrite mode. Archive
      recovery of course *will* be allowed, but I'm not that far yet.

This is the very heart of the matter. This isn't just a technical issue,
it goes to the heart of the use case for this feature. Can we recover
while running queries?
Yes. My comment only applies only to crash recovery - i.e, recovery that happens
*without* a recovery.conf present, after a crash.
It only really matters if you do following
  .) Start pg in readwrite mode.
  .) Kill it / It crashes
  .) Restart in readonly mode.

The main different between crash recovery, and recovery from a filesystem-level
backup is the additional information that the backup label gives us in the
second case - more specifically, the minRecoveryLoc that we read from the
backup label. Only with that knowledge is "recovering until we reach
a consistent state" a welldefined operation. And readonly queries
can only be executed *after* we did this "minimal recovery". So if there is
crash recovery to be done, we best we could do is to recover, and then start
in readonly mode. If this is *really* what the DBA wants, he can just start
in readwrite mode first, then cleanly shut PG down, and restart in readonly 
mode.

> If not, how much time will we spend in replay
mode v query mode? Will we be able to run long running queries *and*
maintain a reasonable time to recover? Is this a mechanism for providing
HA and additional query capacity, or is it just a mechanism for
additional query capacity only? Those are open questions to which I
don't have any answers yet myself.
My goal is to allow replay and queries to run concurrently, at least as
long as only inserts, updates and deletes to non-system tables are replayed
(Which I assume is 99% of the wal-traffic on a typical productive pg
installation).

Will we switch back and forth between replay and query mode. Do we connect to the master, or to the slave?
If we connect to the slave will we accept new queries when in replay
mode and pause them before we switch back to query mode.

*If* the replay process has to pause query execution, it will
acquire some global lock in write mode. That same lock will be acquired
in read mode if a client starts a transaction. So waiting for the recovery
process will "feel" just the same as waiting for another query to release
a lock.

Open Problems:
--------------
   .) Protecting MarkBufferDirty with ASSUME_OPMODE_READWRITE is troublesome,
      because callers usually call MarkBufferDirty from within a critical
      section, and thus elog(ERRROR) is turned into elog(PANIC). This e.g.
      happens with my patch if you call nextval() in readonly mode.
      Does anyone see a better solution then adding checks into
      all callers that are not otherwise protected from being called
      in readonly mode?

Do we need to do this at all?
We need a waterproof protection against writes in readonly mode. The current
"set transaction readonly" is far away from being that - it e.g. allows nextval(). So protecting MarkBufferDirty seems like a good last-resort protection - even a PANIC is *much* better than just silently changing data
on the slave I think.

   .) Since the slaves needs to track an Snapshot in shared memory, it cannot
      resize that snapshot to accomodate however many concurrent transactions
      might have been running on the master. My current plan is to detect if
      that global snapshot overflows, and to lock out readonly queries on the
      slave (and therefore remove the need of keeping the snapshot current)
      until the number of active xids on the master has dropped below
      max_connections on the slave. A warning will be written to the postgres
      log that suggest that the DBA increases the max_connections value on
      the slave.

Sized according to max_connections on the master?
Well, but how do I get that value? And the master might be restarted with
a different max_connections value while the slave keeps running...

greetings, Florian Pflug

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to