On 30.11.2010 06:57, Robert Haas wrote:
I can't say I'm totally in love with any of these designs.  Anyone
else have any ideas, or any opinions about which one is best?

Well, the design I've been pondering goes like this:

At vacuum:

1. Write an "intent" XLOG record listing a chunk of visibility map bits that are not currently set, that we are going to try to set. A chunk of say 100 bits would be about right.

2. Scan the 100 heap pages as we currently do, setting the visibility map bits as we go.

3. After the scan, lock the visibility map page, check which of the bits that we set in step 2 are still set (concurrent updates might've cleared some), and write a final XLOG record listing the set bits. This step isn't necessary for correctness, BTW, but without it you lose all the set bits if you crash before next checkpoint.

At replay, when we see the intent XLOG record, clear all the bits listed in it. This ensures that if we crashed and some of the visibility map bits were flushed to disk but the corresponding changes to the heap pages were not, the bits are cleared. When we see the final XLOG record, we set the bits.

Some care is needed with checkpoints. Setting visibility map bits in step 2 is safe because crash recovery will replay the intent XLOG record and clear any incorrectly set bits. But if a checkpoint has happened after the intent XLOG record was written, that's not true. This can be avoided by checking RedoRecPtr in step 2, and writing a new intent XLOG record if it has changed since the last intent XLOG record was written.

There's a small race condition in the way a visibility map bit is currently cleared. When a heap page is updated, it is locked, the update is WAL-logged, and the lock is released. The visibility map page is updated only after that. If the final vacuum XLOG record is written just after updating the heap page, but before the visibility map bit is cleared, replaying the final XLOG record will set a bit that should not have been set.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to