Hi, On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote: > Good point, thanks. From the ext3 source code, it looks like > ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the > block device, whereas simple direct IO does not. So, that would make > this wal_sync_method option less useful, since, as you say, the user > would have to know if the block device is doing write caching. The experiments I know which played with disabling write caches nearly always had the result that write caching as worth the overhead of syncing.
> For the numbers I reported, I don't think the performance gain is from > not doing the block device flush. The system being measured is a Fibre > Channel disk which should have a fully-nonvolatile disk array. And > measurements using systemtap show that blkdev_issue_flush() always takes > only in the microsecond range. Well, I think it has some io queue implications which could explain some of the difference. With that regard I think it heavily depends on the kernel version as thats an area which had loads of pretty radical changes in nearly every release since 2.6.32. > I think the overhead is still from the fact that ext3_sync_file() waits > for the current in-flight transaction if there is one (and does an > explicit device flush if there is no transaction to wait for.) I do > think there are lots of meta-data operations happening on the data files > (especially for a growing database), so the WAL log commit is waiting for > unrelated data operations. It would be nice if there a simple file > system operation that just flushed the cache of the block device > containing the filesystem (i.e. just does the blkdev_issue_flush() and > not the other things in ext3_sync_file()). I think you are right there. I think the metadata issue could be relieved a lot by doing the growing of files in way much larger bits than currently. I have seen profiles which indicated that lots of time was spent on increasing the file size. I would be very interested in seing how much changes in that area would benefit real-world benchmarks. > The ext4_sync_file() code looks fairly similar, so I think it may have > the same problem, though I can't be positive. In that case, this > wal_sync_method option might help ext4 as well. The journaling code for ext4 is significantly different so I think it very well might play a role here - although youre probably right and it wont be in *_sync_file. > With respect to sync_file_range(), the Linux code that I'm looking at > doesn't really seem to indicate that there is a device flush (since it > never calls a f_op->fsync_file operation). So sync_file_range() may be > not be as useful as thought. Hm, need to check that. I thought it invoked that path somewhere. > By the way, all the numbers were measured with "data=writeback, > barrier=1" options for ext3. I don't think that I have seen a > significant different when the DBT2 workload for ext3 option > data=ordered. You have not? Interesting again because I have seen results that differed by a magnitude. > I will measure all these numbers again tonight, but with barrier=0, so as > to try to confirm that the write flush itself isn't costing a lot for > this configuration. Got any result so far? Thanks, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers