On Oct25, 2011, at 13:39 , Florian Pflug wrote: > On Oct25, 2011, at 11:13 , Simon Riggs wrote: >> On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <si...@2ndquadrant.com> wrote: >>> We are starting recovery at the right place but we are initialising >>> the clog and subtrans incorrectly. Precisely, the oldestActiveXid is >>> being derived later than it should be, which can cause problems if >>> this then means that whole pages are unitialised in subtrans. The bug >>> only shows up if you do enough transactions (2048 is always enough) to >>> move to the next subtrans page between the redo pointer and the >>> checkpoint record while at the same time we do not have a long running >>> transaction that spans those two points. That's just enough to happen >>> reasonably frequently on busy systems and yet just enough to have >>> slipped through testing. >>> >>> We must either >>> >>> 1. During CreateCheckpoint() we should derive oldestActiveXid before >>> we derive the redo location > >> (1) looks the best way forwards in all cases. > > Let me see if I understand this > > The probem seems to be that we currently derive oldestActiveXid end the end of > the checkpoint, just before writing the checkpoint record. Since we use > oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before > that checkpoint record (but after the REDO location, of course) may very well > contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt > to touch these XID's SUBTRANS state, we error out. > > Your patch seems sensible, because the checkpoint "logically" occurs at the > REDO location not the checkpoint's location, so we ought to log an > oldestActiveXID > corresponding to that location.
Thinking about this some more (and tracing through the code), I realized that things are a bit more complicated. What we actually need to ensure, I think, is that the XID we pass to StartupSUBTRANS() is earlier than any top-level XID in XLOG_XACT_ASSIGNMENT records. Which, at first glance, implies that we ought to use the nextId at the *beginning* of the checkpoint for SUBTRANS initialization. At second glace, however, that'd be wrong, because backends emit XLOG_XACT_ASSIGNMENT only every PGPROC_MAX_CACHED_SUBXIDS sub-xid assignment. Thus, an XLOG_XACT_ASSIGNMENT written *after* the checkpoint has started may contain sub-XIDs which were assigned *before* the checkpoint has started. Using oldestActiveXID works around that because we guarantee that sub-XIDs are always larger than their parent XIDs and because only active transactions can produce XLOG_XACT_ASSIGNMENT records. So your patch is fine, but I think the reasoning about why oldestActiveXID is the correct value for StartupSUBTRANS deserves an explanation somewhere. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers