Good point, thanks. From the ext3 source code, it looks like ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the block device, whereas simple direct IO does not. So, that would make this wal_sync_method option less useful, since, as you say, the user would have to know if the block device is doing write caching.
For the numbers I reported, I don't think the performance gain is from not doing the block device flush. The system being measured is a Fibre Channel disk which should have a fully-nonvolatile disk array. And measurements using systemtap show that blkdev_issue_flush() always takes only in the microsecond range. I think the overhead is still from the fact that ext3_sync_file() waits for the current in-flight transaction if there is one (and does an explicit device flush if there is no transaction to wait for.) I do think there are lots of meta-data operations happening on the data files (especially for a growing database), so the WAL log commit is waiting for unrelated data operations. It would be nice if there a simple file system operation that just flushed the cache of the block device containing the filesystem (i.e. just does the blkdev_issue_flush() and not the other things in ext3_sync_file()). The ext4_sync_file() code looks fairly similar, so I think it may have the same problem, though I can't be positive. In that case, this wal_sync_method option might help ext4 as well. With respect to sync_file_range(), the Linux code that I'm looking at doesn't really seem to indicate that there is a device flush (since it never calls a f_op->fsync_file operation). So sync_file_range() may be not be as useful as thought. By the way, all the numbers were measured with "data=writeback, barrier=1" options for ext3. I don't think that I have seen a significant different when the DBT2 workload for ext3 option data=ordered. I will measure all these numbers again tonight, but with barrier=0, so as to try to confirm that the write flush itself isn't costing a lot for this configuration. Dan ----- Original Message ----- From: "Andres Freund" <and...@anarazel.de> To: pgsql-hackers@postgresql.org Cc: "Dan Scales" <sca...@vmware.com> Sent: Thursday, February 16, 2012 10:32:09 AM Subject: Re: [HACKERS] possible new option for wal_sync_method Hi, On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote: > When running Postgres on a single ext3 filesystem on Linux, we find that > the attached simple patch gives significant performance benefit (7-8% in > numbers below). The patch adds a new option for wal_sync_method, which > is "open_direct". With this option, the WAL is always opened with > O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only > O_DIRECT should be correct. All WAL logs are fully allocated before > being used, and the WAL buffers are 8K-aligned, so all direct writes are > guaranteed to complete before returning. (See > http://lwn.net/Articles/348739/) I don't think that behaviour is safe in the face of write caches in the IO path. Linux takes care to issue flush/barrier instructions when necessary if you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is used (That would suck performancewise). I think that behaviour is safe if you have no externally visible write caching enabled but thats not exactly easy to get/document knowledge. Why should there otherwise be any performance difference between O_DIRECT| O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be written and I have a hard time imaging that the check whether there is metadata is that expensive. I guess a more interesting case would be comparing O_DIRECT|O_SYNC with O_DIRECT + fdatasync() or even O_DIRECT + sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER) Any special reason youve did that comparison on ext3? Especially with data=ordered its behaviour regarding syncs is pretty insane performancewise. Ext4 would be a bit more interesting... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers