On Tue, Nov 9, 2010 at 7:04 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus <j...@agliodbs.com> wrote: >> Robert, >> >>> Uh, no it doesn't. It only requires you to be more aggressive about >>> vacuuming the transactions that are in the aborted-XIDs array. It >>> doesn't affect transaction wraparound vacuuming at all, either >>> positively or negatively. You still have to freeze xmins before they >>> flip from being in the past to being in the future, but that's it. >> >> Sorry, I was trying to say that it's similar to the freeze issue, not >> that it affects freeze. Sorry for the lack of clarity. >> >> What I was getting at is that this could cause us to vacuum >> relations/pages which would otherwise never be vaccuumed (or at least, >> not until freeze). Imagine a very large DW table which is normally >> insert-only and seldom queried, but once a month or so the insert aborts >> and rolls back. > > Oh, I see. In that case, under the proposed scheme, you'd get an > immediate vacuum of everything inserted into the table since the last > failed insert. Everything prior to the last failed insert would be > OK, since the visibility map bits would already be set for those > pages. Yeah, that would be annoying.
Ah, but it might be fixable. You wouldn't really need to do a full-fledged vacuum. It would be sufficient to scan the heap pages that might contain the XID we're trying to clean up after, without touching the indexes. Instead of actually removing tuples with an aborted XMIN, you could just mark the line pointers LP_DEAD. Tuples with an aborted XMAX don't require touching the indexes anyway. So as long as you have some idea which segment of the relation was potentially dirtied by that transaction, you could just scan those blocks and update the item pointers and/or XMAX values for the offending tuples without doing anything else (although you'd probably want to opportunistically grab the buffer cleanup lock and defragment if possible). Unfortunately, I'm now realizing another problem. During recovery, you have to assume that any XIDs that didn't commit are aborted; under the scheme I proposed upthread, if a transaction that was in-flight at crash time had begun prior to the last checkpoint, you wouldn't know which relations it had potentially dirtied. Ouch. But I think this is fixable, too. Let's invent a new on-disk structure called the content-modified log. Transactions that want to insert, update, or delete tuples allocate pages from this structure. The header of each page stores the XID of the transaction that owns that page and the ID of the database to which that transaction is bound. Following the header, there are a series of records of the form: tablespace OID, table OID, starting page number, ending page number. Each such record indicates that the given XID may have put its XID on disk within the given page range of the specified relation. Each checkpoint flushes the dirty pages of the modified-content log to disk along with everything else. Thus, on redo, we can reconstruct the additional entries that need to be added to the log from the contents of WAL subsequent to the redo pointer. If a transaction commits, we can remove all of its pages from the modified-content log; in fact, if a transaction begins and commits without an intervening checkpoint, the pages never need to hit the disk at all. If a transaction aborts, its modified-content log pages must stick around until we've eradicated any copies of its XID in the relation data files. We maintain a global value for the oldest aborted XID which is not yet fully cleaned up (let's called this the OldestNotQuiteDeadYetXID). When we see an XID which precedes OldestNotQuiteDeadYetXID, we know it's committed. Otherwise, we check whether the XID precedes the xmin of our snapshot. If it does, we have to check whether the XID is committed or aborted (it must be one or the other). If it does not, we use our snapshot, as now. Checking XIDs between OldestNotQuiteDeadYetXID and our snapshot's xmin is potentially expensive, but (1) if there aren't many aborted transactions, this case shouldn't arise very often; (2) if the XID turns out to be aborted and we can get an exclusive buffer content lock, we can nuke that copy of the XID to save the next guy the trouble of examining it; and (3) we can maintain a size-limited per-backend cache of this information, which should help in the normal cases where there either aren't that many XIDs that fall into this category or our transaction doesn't see all that many of them. This also addresses Tom's concern about needing to store all the information in memory, and the need to WAL-log not-yet-cleaned-up XIDs at each checkpoint. You still need to aggressively clean up after aborted transactions, either using our current vacuum mechanism or the "just zap the XIDs" shortcut described above. (An additional interesting point about this design is that you could potentially also use it to drive vacuum activity for transactions that commit, especially if we were to also store a flag indicating whether each page range contained updates/deletes or only inserts.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers