On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <n...@leadboat.com> wrote: > We do, in numerous places, drop a shared buffer content lock and reacquire it > in exclusive mode. However, the existing users of that pattern tend to trade > the lock, complete subsequent work, and unlock the buffer all within the same > function. So they must, because several of them recheck facts that can change > during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers > do not adhere to that pattern; they can be quite distant from the original > lock acquisition and eventual lock release. Take the prototypical case of > SetHintBits(). Our shared lock originated in a place like heapgettup(), which > expects to hold it continually until finished. We would need to somehow push > up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer > lock, then have heapgettup() reorient itself as needed. Every caller of code > that can reach SetBufferCommitInfoNeedsSave() would need to do the same. > Perhaps there's a different way to apply the lock-trade technique that avoids > this mess, but I'm not seeing it. Consequently, I find the idea of requiring > a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE > to be a sensible one. Within that umbrella, some details need attention:
Fair analysis. > - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note > gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the > spinlock for heap buffers, letting gistScanPage() off the hook.) We need a > public API, perhaps LockBufferForLSN(). That seems like a good idea. I am a little worried about Simon's plan to do the additional locking only for AMs that have no unlogged-type updates. It's a reasonable performance optimization, and skipping the locking when checksums are shut off also seems reasonable, but it seems a bit fragile: suppose that, in the future, someone fixes GiST to do something intelligent with kill_prior_tuple. Now suddenly it needs LSN locking that it didn't need before, but this fact might not be very obvious. It might be a good idea to design LockBufferForLSN to take an AM OID as an argument, and use the AM OID to decide whether the locking is required. That way, we can call it from everywhere that reads an LSN, and it can simply skip the locking in places where it isn't presently needed. Or maybe there's a better design, but I agree with you that some kind of public API is essential. > - The use of some spinlock need not imply using the buffer header spinlock. > We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual > trade-off of splitting a lock: less contention at the cost of more > acquisitions. I have no intuition on which approach would perform better. I think I like the idea of a separate lock, but I agree it could stand to be tested both ways. Another thought is that we might add SpinLockConditionalAcquire(), that just tries to TAS the lock and returns false if the lock is already held. Then, we could grab the spinlock while writing the page, in lieu of copying it, and anyone who wants to set a hint bit can conditionally acquire the spinlock long enough to set the bits. If the spinlock isn't immediately free then instead of spinning they just don't set the hint bits after all. That way you can be sure that no hint bits are set during the write, but any attempt to set a hint bit only costs you a single TAS - no loop, as with a regular spinlock acquisition - and of course the hint itself. > A possible compromise is to leave the page clean after setting a hint bit, > much like the patch already has us do under hot standby. Then there's no new > WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin > benchmarked when he was looking at hint bits? Does anyone recall the result? It slows things down substantially in the case of repeated scans of the same unchaning table. In fact, I tried a much more nuanced approach: set BM_UNTIDY on the page when SetBufferCommitInfoNeedsSave() is called. BM_UNTIDY pages are written by the background writer and during checkpoints, and 5% of the time by backends. All of that has the salutary effect of making the first sequential scan less ungodly slow, but it also has the less-desirable effect of making the next 19 of them still kinda-sorta slow. It was unclear to me (and perhaps to others) whether it really worked out to a win. However, we might need/want to revisit some of that logic in connection with this patch, because it seems to me that a large sequential scan of an unhinted table could be ferociously slow, with the backend writing out its own pages and WAL-logging them as it goes. It would be good to test that to figure out whether some mitigation measures are needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers