On Mon, Jan 31, 2011 at 4:28 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> Back to the idea at hand - I proposed something a bit along these >> lines upthread, but my idea was to proactively perform the fsyncs on >> the relations that had gone the longest without a write, rather than >> the ones with the most dirty data. > > Yeah. What I meant to suggest, but evidently didn't explain well, was > to use that or something much like it as the rule for deciding *what* to > fsync next, but to use amount-of-unsynced-data-versus-threshold as the > method for deciding *when* to do the next fsync.
Oh, I see. Yeah, that could be a good algorithm. I also think Bruce's idea of calling fsync() on each relation just *before* we start writing the pages from that relation might have some merit. (I'm assuming here that we are sorting the writes.) That should tend to result in the end-of-checkpoint fsyncs being quite fast, because we'll only have as much dirty data floating around as we actually wrote during the checkpoint, which according to Greg Smith is usually a small fraction of the total data in need of flushing. Also, if one of the pre-write fsyncs takes a long time, then that'll get factored into our calculations of how fast we need to write the remaining data to finish the checkpoint on schedule. Of course there's still the possibility that the I/O system literally can't finish a checkpoint in X minutes, but even in that case, the I/O saturation will hopefully be more spread out across the entire checkpoint instead of falling like a hammer at the very end. Back to your idea: One problem with trying to bound the unflushed data is that it's not clear what the bound should be. I've had this mental model where we want the OS to write out pages to disk, but that's not always true, per Greg Smith's recent posts about Linux kernel tuning slowing down VACUUM. A possible advantage of the Momjian algorithm (as it's known in the literature) is that we don't actually start forcing anything out to disk until we have a reason to do so - namely, an impending checkpoint. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers