Greg Smith muttered a while ago about wanting to do something with sync_file_range to improve checkpoint behavior on Linux. I thought he was talking about trying to sync only the range of blocks known to be dirty, which didn't seem like a very exciting idea, but after looking at the man page for sync_file_range, I think I understand what he was really going for: sync_file_range allows you to hint the Linux kernel that you'd like it to clean a certain set of pages. I further recall from Greg's previous comments that in the scenarios he's seen, checkpoint I/O spikes are caused not so much by the data written out by the checkpoint itself but from the other dirty data in the kernel buffer cache. Based on that, I whipped up the attached patch, which, if sync_file_range is available, simply iterates through everything that will eventually be fsync'd before beginning the write phase and tells the Linux kernel to put them all under write-out.
I don't know that I have a suitable place to test this, and I'm not quite sure what a good test setup would look like either, so while I've tested that this appears to issue the right kernel calls, I am not sure whether it actually fixes the problem case. But here's the patch, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
writeback-v1.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers