Re: [HACKERS] Update on PITR

Simon Riggs Wed, 31 Mar 2004 13:16:03 -0800

>Bruce Momjian wrote
> Tom Lane wrote:
> > "Simon Riggs" <[EMAIL PROTECTED]> writes:
> > > [ expecting to finish PITR by early June ]
> >
> > > Is this all still OK for 7.5? (My attempts at cataloguing
> changes has
> > > fallen by the wayside in concentrating on the more
> important task of
> > > PITR.) Do we have a planned freeze month yet?
> >
> > There's not really a plan at the moment, but I had June in
> the back of
> > my head as a good time; it looks to me like the Windows port will be
> > stable enough for beta in another month or two, and it'd be good if
> > PITR were ready to go by then.
> >
> > Is your timeline based on the assumption of doing all the
> work yourself?
> > If so, how about farming out some of it?  I'd be willing to
> contribute
> > some effort to PITR.  (It's been made clear to me that Red
> Hat really
> > wants PITR in 7.5 ;-))
>
> Agreed!  Lets see if we can assemble a team to start coding PITR.
>


Thank you all for your offers of help. Yes, is the short answer; we
should be able to cut enough code on independent work streams to get
this system testable by end April.

As I say, I started coding some time back and am well into what I've
called Phase 1, so its probably best for me to complete that. You guys
will be contributing by looking at my code anyhow, so your goodwill is
certainly going to be called in, don't worry. There isn't anything too
hairy code wise anyhow, if I'm honest. For clarity, this will give the
facility to archive xlogs beyond their current short lifetime in the
recycling method currently used.

Phase 2 is definitely another matter...There doesn't seem to be any
dependency that I can see for that.... I called it Phase 2 because, yes,
I did make the assumption that I was doing it all myself, but I did set
off on this journey as a team effort and I welcome that still...
I described this piece of work earlier as:
Phase 2: add code to control recovery (to a point-in-time)
- will allow rollforward along extended archive history to point in
time, diskspace permitting

In my earlier summary of all design contributions there was the idea for
a postmaster command line switch which would make rollforward recovery
stop at the appropriate place. Two switches were discussed:
i) roll forward to point in time. This sounds relatively easy...recovery
is already there, all you have to do is stop it at the right place...but
I haven't looked into the exact mechanism of reading the xlog headers
etc.. [There's also a few bits of work to do there in terms of putting
in hooks for the command mechanism.
Something like postmaster -R "2004/12/10 19:37:04" as a loose example

ii) roll forward on all available logs, but shutdown at end, leaving pg
in a recovery-pending state (still). This then gives the DBA a chance to
do either a) switch in a new batch of xlogs, allowing an infinite
sequence of xlogs to be applied one batch at a time, or b) keep a "hot
standby" system continually primed and ready to startup should the
primary go down.

Neither of those looks too hard to me, so should be able to be done by
about mid-April when I'm thinking to have finished XLogArchive API. As I
say there's no particular dependency on the XLogArchive API stuff all
working, since they can both be tested independently, though we must put
them together for system testing.

Further tasks (what I had thought of as "Phase 3", but again these can
be started now...)
- what to do should a cancel CTRL-C be issued during recovery..what
state is the database left in?
- How do you say "this is taking to long, I really need my database up
now, whatever state its in" (when recovery is grinding away, not before
you start it or after it has failed, which is where you would use
pg_resetxlog)....
- can you change your mind on that once its up and you see what a mess
its in! i.e. put it back into recovery? what would that take - saving
clogs? an optional "trial recovery" mode?
- how would we monitor a lengthy recovery? watch for "starting recovery
of log XXX" messages and do some math to work out the finish time, or is
there a better way?
- is it possible to have parallel recovery processes active
simultaneously for faster recovery?? can we work anything into the
design now that would allow that to be added later?

What I think is really important is a very well coordinated test plan.
Perhaps even more importantly a test plan not written by me, since I
might make some dangerous assumptions in writing it. Having a written
test plan would allow us to cover all the edge cases that PITR is
designed to recover from. It will be pretty hard for most production
users of PostgreSQL to fully test PITR, though of course many will "kick
the tyres" shall we say, to confirm a full implementation. Many of the
tests are not easily automatable, so we can't just dream up some more
regression tests. A written plan would then allow coordinated testing to
occur across platforms, so a QNX user may spot something that also
effects Solaris etc.. Is that anything any large open source outfit
(commercial or non-commerical) would be able to contribute the time of a
test analyst for 10 days to complete? ;) ...I didn't get a huge response
from the community when last I requested assistance in that area,
assuming my post got through.

Overall, I'm confident that we're getting close to this becoming fully
real. The main issues historically were that the xlogs didn't contain
all that was required for full recovery, but those challenges have been
solved already by J.R. and Patrick(subject to full system testing).

Best Regards,

Simon Riggs
2nd Quadrant


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] Update on PITR

Reply via email to