Re: [HACKERS] possible new option for wal_sync_method

Andres Freund Mon, 27 Feb 2012 12:44:36 -0800

Hi,

On Friday, February 17, 2012 01:17:27 AM Dan Scales wrote:
> Good point, thanks.  From the ext3 source code, it looks like
> ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
> block device, whereas simple direct IO does not.  So, that would make
> this wal_sync_method option less useful, since, as you say, the user
> would have to know if the block device is doing write caching.
The experiments I know which played with disabling write caches nearly always 
had the result that write caching as worth the overhead of syncing.


> For the numbers I reported, I don't think the performance gain is from
> not doing the block device flush.  The system being measured is a Fibre
> Channel disk which should have a fully-nonvolatile disk array.  And
> measurements using systemtap show that blkdev_issue_flush() always takes
> only in the microsecond range.
Well, I think it has some io queue implications which could explain some of 
the difference. With that regard I think it heavily depends on the kernel 
version as thats an area which had loads of pretty radical changes in nearly 
every release since 2.6.32.

> I think the overhead is still from the fact that ext3_sync_file() waits
> for the current in-flight transaction if there is one (and does an
> explicit device flush if there is no transaction to wait for.)  I do
> think there are lots of meta-data operations happening on the data files
> (especially for a growing database), so the WAL log commit is waiting for
> unrelated data operations.  It would be nice if there a simple file
> system operation that just flushed the cache of the block device
> containing the filesystem (i.e. just does the blkdev_issue_flush() and
> not the other things in ext3_sync_file()).
I think you are right there. I think the metadata issue could be relieved a 
lot by doing the growing of files in way much larger bits than currently. I 
have seen profiles which indicated that lots of time was spent on increasing 
the file size. I would be very interested in seing how much changes in that 
area would benefit real-world benchmarks.

> The ext4_sync_file() code looks fairly similar, so I think it may have
> the same problem, though I can't be positive.  In that case, this
> wal_sync_method option might help ext4 as well.
The journaling code for ext4 is significantly different so I think it very 
well might play a role here - although youre probably right and it wont be in 
*_sync_file.

> With respect to sync_file_range(), the Linux code that I'm looking at
> doesn't really seem to indicate that there is a device flush (since it
> never calls a f_op->fsync_file operation).  So sync_file_range() may be
> not be as useful as thought.
Hm, need to check that. I thought it invoked that path somewhere.

> By the way, all the numbers were measured with "data=writeback,
> barrier=1" options for ext3.  I don't think that I have seen a
> significant different when the DBT2 workload for ext3 option
> data=ordered.
You have not? Interesting again because I have seen results that differed by a 
magnitude.

> I will measure all these numbers again tonight, but with barrier=0, so as
> to try to confirm that the write flush itself isn't costing a lot for
> this configuration.
Got any result so far?

Thanks,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible new option for wal_sync_method

Reply via email to