Re: [HACKERS] possible new option for wal_sync_method

Dan Scales Thu, 16 Feb 2012 16:18:06 -0800

Good point, thanks.  From the ext3 source code, it looks like
ext3_sync_file() does a blkdev_issue_flush(), which issues a flush to the
block device, whereas simple direct IO does not.  So, that would make
this wal_sync_method option less useful, since, as you say, the user
would have to know if the block device is doing write caching.

For the numbers I reported, I don't think the performance gain is from
not doing the block device flush.  The system being measured is a Fibre
Channel disk which should have a fully-nonvolatile disk array.  And
measurements using systemtap show that blkdev_issue_flush() always takes
only in the microsecond range.

I think the overhead is still from the fact that ext3_sync_file() waits
for the current in-flight transaction if there is one (and does an
explicit device flush if there is no transaction to wait for.)  I do
think there are lots of meta-data operations happening on the data files
(especially for a growing database), so the WAL log commit is waiting for
unrelated data operations.  It would be nice if there a simple file
system operation that just flushed the cache of the block device
containing the filesystem (i.e. just does the blkdev_issue_flush() and
not the other things in ext3_sync_file()).

The ext4_sync_file() code looks fairly similar, so I think it may have
the same problem, though I can't be positive.  In that case, this
wal_sync_method option might help ext4 as well.

With respect to sync_file_range(), the Linux code that I'm looking at
doesn't really seem to indicate that there is a device flush (since it
never calls a f_op->fsync_file operation).  So sync_file_range() may be
not be as useful as thought.

By the way, all the numbers were measured with "data=writeback,
barrier=1" options for ext3.  I don't think that I have seen a
significant different when the DBT2 workload for ext3 option
data=ordered.

I will measure all these numbers again tonight, but with barrier=0, so as
to try to confirm that the write flush itself isn't costing a lot for
this configuration.

Dan

----- Original Message -----
From: "Andres Freund" <[email protected]>
To: [email protected]
Cc: "Dan Scales" <[email protected]>
Sent: Thursday, February 16, 2012 10:32:09 AM
Subject: Re: [HACKERS] possible new option for wal_sync_method

Hi,

On Thursday, February 16, 2012 06:18:23 PM Dan Scales wrote:
> When running Postgres on a single ext3 filesystem on Linux, we find that
> the attached simple patch gives significant performance benefit (7-8% in
> numbers below).  The patch adds a new option for wal_sync_method, which
> is "open_direct".  With this option, the WAL is always opened with
> O_DIRECT (but not O_SYNC or O_DSYNC).  For Linux, the use of only
> O_DIRECT should be correct.  All WAL logs are fully allocated before
> being used, and the WAL buffers are 8K-aligned, so all direct writes are
> guaranteed to complete before returning.  (See
> http://lwn.net/Articles/348739/)
I don't think that behaviour is safe in the face of write caches in the IO 
path. Linux takes care to issue flush/barrier instructions when necessary if 
you issue an fsync/fdatasync, but to my knowledge it does not when O_DIRECT is 
used (That would suck performancewise).
I think that behaviour is safe if you have no externally visible write caching 
enabled but thats not exactly easy to get/document knowledge.

Why should there otherwise be any performance difference between O_DIRECT|
O_SYNC and O_DIRECT in wal write case? There is no metadata that needs to be 
written and I have a hard time imaging that the check whether there is 
metadata is that expensive.

I guess a more interesting case would be comparing O_DIRECT|O_SYNC with 
O_DIRECT + fdatasync() or even O_DIRECT +  
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | 
SYNC_FILE_RANGE_WAIT_AFTER)

Any special reason youve did that comparison on ext3? Especially with 
data=ordered its behaviour regarding syncs is pretty insane performancewise. 
Ext4 would be a bit more interesting...

Andres

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] possible new option for wal_sync_method

Reply via email to