On Tue, Nov 9, 2010 at 2:05 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark <gsst...@mit.edu> wrote: >> On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk <ai...@highrise.ca> wrote: >>> So, for getting checksums, we have to offer up a few things: >>> 1) zero-copy writes, we need to buffer the write to get a consistent >>> checksum (or lock the buffer tight) >>> 2) saving hint-bits on an otherwise unchanged page. We either need to >>> just not write that page, and loose the work the hint-bits did, or do >>> a full-page WAL of it, so the torn-page checksum is fixed >> >> Actually the consensus the last go-around on this topic was to >> segregate the hint bits into a single area of the page and skip them >> in the checksum. That way we don't have to do any of the above. It's >> just that that's a lot of work. > > And it still allows silent data corruption, because bogusly clearing a > hint bit is, at the moment, harmless, but bogusly setting one is not. > I really have to wonder how other products handle this. PostgreSQL > isn't the only database product that uses MVCC - not by a long shot - > and the problem of detecting whether an XID is visible to the current > snapshot can't be ours alone. So what do other people do about this? > They either don't cache the information about whether the XID is > committed in-page (in which case, are they just slower or do they have > some other means of avoiding the performance hit?) or they cache it in > the page (in which case, they either WAL log it or they don't checksum > it). I mean, there aren't any other options, are there?
An examination of the MySQL source code reveals their answer. In row_vers_build_for_semi_consistent_read(), which I can't swear is the right place but seems to be, there is this comment: /* We assume that a rolled-back transaction stays in TRX_ACTIVE state until all the changes have been rolled back and the transaction is removed from the global list of transactions. */ Which makes sense. If you never leave rows from aborted transactions in the heap forever, then the list of aborted transactions that you need to remember for MVCC purposes will remain relatively small and you can just include those XIDs in your MVCC snapshot. Our problem is that we have no particular bound on the number of aborted transactions whose XIDs may still be floating around, so we can't do it that way. <dons asbestos underpants> To impose a similar bound in PostgreSQL, you'd need to maintain the set of aborted XIDs and the relations that need to be vacuumed for each one. As you vacuum, you prune any tuples with aborted xmins (which is WAL-logged already anyway) and additionally WAL-log clearing the xmax for each tuple with an aborted xmax. Thus, when you finishing vacuuming the relation, the aborted XID is no longer present anywhere in it. When you vacuum the last relation for a particular XID, that XID no longer exists in the relation files anywhere and you can remove it from the list of aborted XIDs. I think that WAL logging the list of XIDs and list of unvacuumed relations for each at each checkpoint would be sufficient for crash safety. If you did this, you could then assume that any XID which precedes your snapshot's xmin is committed. 1. When a big abort happens, you may have to carry that XID around in every snapshot - and avoid advancing RecentGlobalXmin - for quite a long time. 2. You have to WAL log marking the XMAX of an aborted transaction invalid. 3. You have to WAL log the not-yet-cleaned-up XIDs and the relations each one needs vacuumed at each checkpoint. 4. There would presumably be some finite limit on the size of the shared memory structure for aborted transactions. I don't think there'd be any reason to make it particularly small, but if you sat there and aborted transactions at top speed you might eventually run out of room, at which point any transactions you started wouldn't be able to abort until vacuum made enough progress to free up an entry. 5. It would be pretty much impossible to run with autovacuum turned off, and in fact you would likely need to make it a good deal more aggressive in the specific case of aborted transactions, to mitigate problems #1, #3, and #4. I'm not sure how bad those things would be, or if there are more that I'm missing (besides the obvious "it would be a lot of work"). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers