On Wed, Jan 4, 2012 at 8:31 AM, Kevin Grittner <kevin.gritt...@wicourts.gov> wrote: >> When we reach a restartpoint, we fsync everything down to disk and >> then nuke the double-write buffer. > > I think we add to the double-write buffer as we write pages from the > buffer to disk. I don't think it makes sense to do potentially > repeated writes of the same page with different contents to the > double-write buffer as we go; nor is it a good idea to leave the page > unsynced and let the double-write buffer grow for a long time.
You may be right. Currently, though, we only fsync() at end-of-checkpoint. So we'd have to think about what to fsync, and how often, to keep the double-write buffer to a manageable size. I can't help thinking that any extra fsyncs are pretty expensive, though, especially if you have to fsync() every file that's been double-written before clearing the buffer. Possibly we could have 2^N separate buffers based on an N-bit hash of the relfilenode and segment number, so that we could just fsync 1/(2^N)-th of the open files at a time. But even that sounds expensive: writing back lots of dirty data isn't cheap. One of the systems I've been doing performance testing on can sometimes take >15 seconds to write a shutdown checkpoint, and I'm sure that other people have similar (and worse) problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers