[HACKERS] Surviving transaction-ID wraparound, take 2

Tom Lane Mon, 13 Aug 2001 09:27:58 -0700
This is an attempt to flesh out the ideas of my earlier proposal
(http://fts.postgresql.org/db/mw/msg.html?mid=67786) into a complete
description of how things should work.  It's still largely the same
proposal, but I have adopted a couple of the ideas from the followup
discussion --- notably Vadim's suggestion that pg_log should be divided
into segments.

I still think that expanding transaction IDs (XIDs) to 8 bytes is no help.
Aside from portability and performance issues, allowing pg_log to grow
without bound just isn't gonna do.  So, the name of the game is to recycle
XIDs in an appropriate fashion.  The intent of this design is to allow XID
recycling in a true 24x7 environment (ie, postmaster uptime must be able
to exceed 4G transactions --- no "restart" events are required).

This plan still depends on periodic VACUUMs, which was a point some people
didn't like the last time around.  However, given that we now have a
lightweight VACUUM that's meant to be run frequently, I don't think this
is a significant objection anymore.

Here's the plan:

1. XIDs will be divided into two kinds, "permanent" and "normal".  There
will be just two permanent XIDs: BootstrapXID (= 1) and FrozenXID (= 2).
(Actually we could get along with only one permanent XID, but it seems
useful for debugging to distinguish the bootstrap XID from transactions
frozen later.)  All XIDs >= 3 are "normal".  The XID generator starts at
3, and wraps around to 3 when it overflows its 32-bit limit.

2. Permanent XIDs are always considered committed and are older than all
normal XIDs.  Two normal XIDs are compared using modulo-2^31 arithmetic,
ie, x < y if ((int32) (y - x)) > 0.  This will work as long as no normal
XID survives in the database for more than 2^31 (2 billion) transactions;
if it did, it would suddenly appear to be "in the future" and thus be
considered uncommitted.  To allow a tuple to live for more than 2 billion
transactions, we must replace its xmin with a permanent XID sometime
before its initial normal XID expires.  FrozenXID is used for this
purpose.

3. VACUUM will have the responsibility of replacing old normal XIDs with
FrozenXID.  It will do this whenever a committed-good tuple has xmin less
than a cutoff XID.  (There is no need to replace xmax, since if xmax is
committed good then the tuple itself will be removed.)  The cutoff XID
could be anything less than XmaxRecent (the oldest XID that might be
considered still running by any current transaction).  I believe that by
default it ought to be pretty old, say 1 billion transactions in the past.
This avoids expending I/O to update tuples that are unlikely to live long;
furthermore, keeping real XIDs around for some period of time is useful
for debugging.

4. To make this work, VACUUM must be run on every table at least once
every billion transactions.  To help keep track of this maintenance
requirement, we'll add two columns to pg_database.  Upon successful
completion of a database-wide (all tables) VACUUM, VACUUM will update the
current database's row in pg_database with the cutoff XID and XmaxRecent
XID that it used.  Inspection of pg_database will then show which
databases are in need of re-vacuuming.  The use of the XmaxRecent entry
will be explained below.

5. There should be a VACUUM option ("VACUUM FREEZE", unless someone can
come up with a better name) that causes the cutoff XID to be exactly
XmaxRecent, not far in the past.  Running VACUUM FREEZE in an otherwise
idle database guarantees that every surviving tuple is frozen.  I foresee
two uses for this:
        A. Doing VACUUM FREEZE at completion of initdb ensures that
           template1 and template0 will have no unfrozen tuples.
           This is particularly critical for template0, since ordinarily
           one couldn't connect to it to vacuum it.
        B. VACUUM FREEZE would be useful for pg_upgrade, since pg_log
           is no longer critical data after a FREEZE.

6. The oldest XmaxRecent value shown in pg_database tells us how far back
pg_log is interesting; we know that all tuples with XIDs older than that
are marked committed in the database, so we won't be probing pg_log to
verify their commit status anymore.  Therefore, we can discard pg_log
entries older than that.  To make this feasible, we should change pg_log
to a segmented representation, with segments of say 256KB (1 million
transaction entries).  A segment can be discarded once oldest-XmaxRecent
advances past it.  At completion of a database-wide VACUUM, in addition
to updating our own pg_database row we'll scan the other rows to determine
the oldest XmaxRecent, and then remove no-longer-needed pg_log segments.

7. A small difficulty with the above is that if template0 is never
vacuumed after initdb, its XmaxRecent entry would always be the oldest and
would keep us from discarding any of pg_log.  A brute force answer is to
ignore template0 while calculating oldest-XmaxRecent, but perhaps someone
can see a cleaner way.

8. Currently, pg_log is accessed through the buffer manager as if it were
an ordinary relation.  It seems difficult to continue this approach if we
are going to allow removal of segments before the current segment (md.c
will not be happy with that).  Instead, I plan to build a special access
mechanism for pg_log that will buffer a few pages of pg_log in shared
memory.  I think much of this can be lifted verbatim from the WAL access
code.

9. WAL redo for pg_log updates will be handled like this: (A) Whenever a
transaction is assigned the first XID in a new pg_log page's worth of
XIDs, we will allocate and zero that page of pg_log, and enter a record
into the WAL that reports having done so.  (We must do this while holding
the XidGenLock lock, which is annoying but it only happens once every 32K
transactions.  Note that no actual I/O need happen, we are just zeroing a
buffer in shared memory and emitting a WAL record.)  Now, before any
transaction commit can modify that page of pg_log, we are guaranteed that
the zero-the-page WAL entry will be flushed to disk.  On crash and
restart, we re-zero the page when we see the zeroing WAL entry, and then
reapply the transaction commit and abort operations shown later in WAL.
AFAICS we do not need to maintain page LSN or SUI information for pg_log
pages if we do it this way (Vadim, do you agree?).  NOTE: unless I'm
missing something, 7.1's WAL code fails to guard against loss of pg_log
pages at all, so this should provide an improvement in reliability.

10. It'd be practical to keep a checksum on pg_log pages with this
implementation (the checksum would be updated just before writing out a
pg_log page, and checked on read).  Not sure if this is worth doing,
but given the critical nature of pg_log it might be a good idea.


Things to do later
------------------

It's looking more and more like an automatic VACUUM scheduler would be a
good idea --- aside from the normal use of VACUUM for space reclamation,
the scheduler could be relied on to dispatch VACUUM to databases whose
cutoff XID was getting too far back.  I don't think I have time to work on
this for 7.2, but it should be a project for 7.3.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[HACKERS] Surviving transaction-ID wraparound, take 2

Reply via email to