Re: [HACKERS] Protecting against unexpected zero-pages: proposal

Robert Haas Tue, 09 Nov 2010 19:32:13 -0800

On Tue, Nov 9, 2010 at 7:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus <j...@agliodbs.com> wrote:
>> Robert,
>>
>>> Uh, no it doesn't.  It only requires you to be more aggressive about
>>> vacuuming the transactions that are in the aborted-XIDs array.  It
>>> doesn't affect transaction wraparound vacuuming at all, either
>>> positively or negatively.  You still have to freeze xmins before they
>>> flip from being in the past to being in the future, but that's it.
>>
>> Sorry, I was trying to say that it's similar to the freeze issue, not
>> that it affects freeze.  Sorry for the lack of clarity.
>>
>> What I was getting at is that this could cause us to vacuum
>> relations/pages which would otherwise never be vaccuumed (or at least,
>> not until freeze).  Imagine a very large DW table which is normally
>> insert-only and seldom queried, but once a month or so the insert aborts
>> and rolls back.
>
> Oh, I see.  In that case, under the proposed scheme, you'd get an
> immediate vacuum of everything inserted into the table since the last
> failed insert.  Everything prior to the last failed insert would be
> OK, since the visibility map bits would already be set for those
> pages.  Yeah, that would be annoying.


Ah, but it might be fixable.  You wouldn't really need to do a
full-fledged vacuum.  It would be sufficient to scan the heap pages
that might contain the XID we're trying to clean up after, without
touching the indexes.  Instead of actually removing tuples with an
aborted XMIN, you could just mark the line pointers LP_DEAD.  Tuples
with an aborted XMAX don't require touching the indexes anyway.  So as
long as you have some idea which segment of the relation was
potentially dirtied by that transaction, you could just scan those
blocks and update the item pointers and/or XMAX values for the
offending tuples without doing anything else (although you'd probably
want to opportunistically grab the buffer cleanup lock and defragment
if possible).

Unfortunately, I'm now realizing another problem.  During recovery,
you have to assume that any XIDs that didn't commit are aborted; under
the scheme I proposed upthread, if a transaction that was in-flight at
crash time had begun prior to the last checkpoint, you wouldn't know
which relations it had potentially dirtied.  Ouch.  But I think this
is fixable, too.  Let's invent a new on-disk structure called the
content-modified log.  Transactions that want to insert, update, or
delete tuples allocate pages from this structure.  The header of each
page stores the XID of the transaction that owns that page and the ID
of the database to which that transaction is bound.  Following the
header, there are a series of records of the form: tablespace OID,
table OID, starting page number, ending page number.  Each such record
indicates that the given XID may have put its XID on disk within the
given page range of the specified relation.  Each checkpoint flushes
the dirty pages of the modified-content log to disk along with
everything else.  Thus, on redo, we can reconstruct the additional
entries that need to be added to the log from the contents of WAL
subsequent to the redo pointer.

If a transaction commits, we can remove all of its pages from the
modified-content log; in fact, if a transaction begins and commits
without an intervening checkpoint, the pages never need to hit the
disk at all.  If a transaction aborts, its modified-content log pages
must stick around until we've eradicated any copies of its XID in the
relation data files.  We maintain a global value for the oldest
aborted XID which is not yet fully cleaned up (let's called this the
OldestNotQuiteDeadYetXID).  When we see an XID which precedes
OldestNotQuiteDeadYetXID, we know it's committed.  Otherwise, we check
whether the XID precedes the xmin of our snapshot.  If it does, we
have to check whether the XID is committed or aborted (it must be one
or the other).  If it does not, we use our snapshot, as now.  Checking
XIDs between OldestNotQuiteDeadYetXID and our snapshot's xmin is
potentially expensive, but (1) if there aren't many aborted
transactions, this case shouldn't arise very often; (2) if the XID
turns out to be aborted and we can get an exclusive buffer content
lock, we can nuke that copy of the XID to save the next guy the
trouble of examining it; and (3) we can maintain a size-limited
per-backend cache of this information, which should help in the normal
cases where there either aren't that many XIDs that fall into this
category or our transaction doesn't see all that many of them.

This also addresses Tom's concern about needing to store all the
information in memory, and the need to WAL-log not-yet-cleaned-up XIDs
at each checkpoint.  You still need to aggressively clean up after
aborted transactions, either using our current vacuum mechanism or the
"just zap the XIDs" shortcut described above.

(An additional interesting point about this design is that you could
potentially also use it to drive vacuum activity for transactions that
commit, especially if we were to also store a flag indicating whether
each page range contained updates/deletes or only inserts.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Protecting against unexpected zero-pages: proposal

Reply via email to