This is an attempt to flesh out the ideas of my earlier proposal (http://fts.postgresql.org/db/mw/msg.html?mid=67786) into a complete description of how things should work. It's still largely the same proposal, but I have adopted a couple of the ideas from the followup discussion --- notably Vadim's suggestion that pg_log should be divided into segments. I still think that expanding transaction IDs (XIDs) to 8 bytes is no help. Aside from portability and performance issues, allowing pg_log to grow without bound just isn't gonna do. So, the name of the game is to recycle XIDs in an appropriate fashion. The intent of this design is to allow XID recycling in a true 24x7 environment (ie, postmaster uptime must be able to exceed 4G transactions --- no "restart" events are required). This plan still depends on periodic VACUUMs, which was a point some people didn't like the last time around. However, given that we now have a lightweight VACUUM that's meant to be run frequently, I don't think this is a significant objection anymore. Here's the plan: 1. XIDs will be divided into two kinds, "permanent" and "normal". There will be just two permanent XIDs: BootstrapXID (= 1) and FrozenXID (= 2). (Actually we could get along with only one permanent XID, but it seems useful for debugging to distinguish the bootstrap XID from transactions frozen later.) All XIDs >= 3 are "normal". The XID generator starts at 3, and wraps around to 3 when it overflows its 32-bit limit. 2. Permanent XIDs are always considered committed and are older than all normal XIDs. Two normal XIDs are compared using modulo-2^31 arithmetic, ie, x < y if ((int32) (y - x)) > 0. This will work as long as no normal XID survives in the database for more than 2^31 (2 billion) transactions; if it did, it would suddenly appear to be "in the future" and thus be considered uncommitted. To allow a tuple to live for more than 2 billion transactions, we must replace its xmin with a permanent XID sometime before its initial normal XID expires. FrozenXID is used for this purpose. 3. VACUUM will have the responsibility of replacing old normal XIDs with FrozenXID. It will do this whenever a committed-good tuple has xmin less than a cutoff XID. (There is no need to replace xmax, since if xmax is committed good then the tuple itself will be removed.) The cutoff XID could be anything less than XmaxRecent (the oldest XID that might be considered still running by any current transaction). I believe that by default it ought to be pretty old, say 1 billion transactions in the past. This avoids expending I/O to update tuples that are unlikely to live long; furthermore, keeping real XIDs around for some period of time is useful for debugging. 4. To make this work, VACUUM must be run on every table at least once every billion transactions. To help keep track of this maintenance requirement, we'll add two columns to pg_database. Upon successful completion of a database-wide (all tables) VACUUM, VACUUM will update the current database's row in pg_database with the cutoff XID and XmaxRecent XID that it used. Inspection of pg_database will then show which databases are in need of re-vacuuming. The use of the XmaxRecent entry will be explained below. 5. There should be a VACUUM option ("VACUUM FREEZE", unless someone can come up with a better name) that causes the cutoff XID to be exactly XmaxRecent, not far in the past. Running VACUUM FREEZE in an otherwise idle database guarantees that every surviving tuple is frozen. I foresee two uses for this: A. Doing VACUUM FREEZE at completion of initdb ensures that template1 and template0 will have no unfrozen tuples. This is particularly critical for template0, since ordinarily one couldn't connect to it to vacuum it. B. VACUUM FREEZE would be useful for pg_upgrade, since pg_log is no longer critical data after a FREEZE. 6. The oldest XmaxRecent value shown in pg_database tells us how far back pg_log is interesting; we know that all tuples with XIDs older than that are marked committed in the database, so we won't be probing pg_log to verify their commit status anymore. Therefore, we can discard pg_log entries older than that. To make this feasible, we should change pg_log to a segmented representation, with segments of say 256KB (1 million transaction entries). A segment can be discarded once oldest-XmaxRecent advances past it. At completion of a database-wide VACUUM, in addition to updating our own pg_database row we'll scan the other rows to determine the oldest XmaxRecent, and then remove no-longer-needed pg_log segments. 7. A small difficulty with the above is that if template0 is never vacuumed after initdb, its XmaxRecent entry would always be the oldest and would keep us from discarding any of pg_log. A brute force answer is to ignore template0 while calculating oldest-XmaxRecent, but perhaps someone can see a cleaner way. 8. Currently, pg_log is accessed through the buffer manager as if it were an ordinary relation. It seems difficult to continue this approach if we are going to allow removal of segments before the current segment (md.c will not be happy with that). Instead, I plan to build a special access mechanism for pg_log that will buffer a few pages of pg_log in shared memory. I think much of this can be lifted verbatim from the WAL access code. 9. WAL redo for pg_log updates will be handled like this: (A) Whenever a transaction is assigned the first XID in a new pg_log page's worth of XIDs, we will allocate and zero that page of pg_log, and enter a record into the WAL that reports having done so. (We must do this while holding the XidGenLock lock, which is annoying but it only happens once every 32K transactions. Note that no actual I/O need happen, we are just zeroing a buffer in shared memory and emitting a WAL record.) Now, before any transaction commit can modify that page of pg_log, we are guaranteed that the zero-the-page WAL entry will be flushed to disk. On crash and restart, we re-zero the page when we see the zeroing WAL entry, and then reapply the transaction commit and abort operations shown later in WAL. AFAICS we do not need to maintain page LSN or SUI information for pg_log pages if we do it this way (Vadim, do you agree?). NOTE: unless I'm missing something, 7.1's WAL code fails to guard against loss of pg_log pages at all, so this should provide an improvement in reliability. 10. It'd be practical to keep a checksum on pg_log pages with this implementation (the checksum would be updated just before writing out a pg_log page, and checked on read). Not sure if this is worth doing, but given the critical nature of pg_log it might be a good idea. Things to do later ------------------ It's looking more and more like an automatic VACUUM scheduler would be a good idea --- aside from the normal use of VACUUM for space reclamation, the scheduler could be relied on to dispatch VACUUM to databases whose cutoff XID was getting too far back. I don't think I have time to work on this for 7.2, but it should be a project for 7.3. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])