On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner <kevin.gritt...@wicourts.gov> wrote: > Robert Haas <robertmh...@gmail.com> wrote: >> we only fsync() at end-of-checkpoint. So we'd have to think about >> what to fsync, and how often, to keep the double-write buffer to a >> manageable size. > > I think this is the big tuning challenge with this technology.
One of them, anyway. I think it may also be tricky to make sure that a backend that needs to write a dirty buffer doesn't end up having to wait for a double-write to be fsync'd. >> I can't help thinking that any extra fsyncs are pretty expensive, >> though, especially if you have to fsync() every file that's been >> double-written before clearing the buffer. Possibly we could have >> 2^N separate buffers based on an N-bit hash of the relfilenode and >> segment number, so that we could just fsync 1/(2^N)-th of the open >> files at a time. > > I'm not sure I'm following -- we would just be fsyncing those files > we actually wrote pages into, right? Not all segments for the table > involved? Yes. >> But even that sounds expensive: writing back lots of dirty data >> isn't cheap. One of the systems I've been doing performance >> testing on can sometimes take >15 seconds to write a shutdown >> checkpoint, > > Consider the relation-file fsyncs for double-write as a form of > checkpoint spreading, and maybe it won't seem so bad. It should > make that shutdown checkpoint less painful. Now, I have been > thinking that on a write-heavy system you had better have a BBU > write-back cache, but that's my recommendation, anyway. I think this point has possibly been beaten to death, but at the risk of belaboring the point I'll bring it up again: the frequency with which we fsync() is basically a trade-off between latency and throughput. If you fsync a lot, then each one will be small, so you shouldn't experience much latency, but throughput might suck. If you don't fsync very much, then you maximize the chances for write-combining (because inserting an fsync between two writes to the same block forces that block to be physically written twice rather than just once) thus improving throughput, but when you do get around to calling fsync() there may be a lot of data to write all at once, and you may get a gigantic latency spike. As far as I can tell, one fsync per checkpoint is the theoretical minimum, and that's what we do now. So our current system is optimized for throughput. The decision to put full-page images into WAL rather than a separate buffer is essentially turning the dial in the same direction, because, in effect, the double-write fsync piggybacks on the WAL fsync which we must do anyway. So both the decision to use a double-write buffer AT ALL and the decision to fsync more frequently to keep that buffer to a manageable size are going to result in turning that dial in the opposite direction. It seems to me inevitable that, even with the best possible implementation, throughput will get worse. With a good implementation but not a bad one, latency should improve. Now, this is not necessarily a reason to reject the idea. I believe that several people have proposed that our current implementation is *overly* optimized for throughput *at the expense of* latency, and that we might want to provide some options that, in one way or another, fsync more frequently, so that checkpoint spikes aren't as bad. But when it comes time to benchmark, we might need to think somewhat carefully about what we're testing... Another thought here is that double-writes may not be the best solution, and are almost certainly not the easiest-to-implement solution. We could instead do something like this: when an unlogged change is made to a buffer (e.g. a hint bit is set), we set a flag on the buffer header. When we evict such a buffer, we emit a WAL record that just overwrites the whole buffer with a new FPI. There are some pretty obvious usage patterns where this is likely to be painful (e.g. load a big table without setting hint bits, and then seq-scan it). But there are also many use cases where the working set fits inside shared buffers and data pages don't get written very often, apart from checkpoint time, and those cases might work just fine. Also, the cases that are problems for this implementation are likely to also be problems for a double-write based implementation, for exactly the same reasons: if you discover at buffer eviction time that you need to fsync something (whether it's WAL or DW), it's going to hurt. Checksums aren't free even when using double-writes: if you don't have checksums, pages that have only hint bit-changes don't need to be double-written. If double writes aren't going to give us anything "for free", maybe that's not the right place to be focusing our efforts... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers