When running Postgres on a single ext3 filesystem on Linux, we find that
the attached simple patch gives significant performance benefit (7-8% in
numbers below).  The patch adds a new option for wal_sync_method, which
is "open_direct".  With this option, the WAL is always opened with
O_DIRECT (but not O_SYNC or O_DSYNC).  For Linux, the use of only
O_DIRECT should be correct.  All WAL logs are fully allocated before
being used, and the WAL buffers are 8K-aligned, so all direct writes are
guaranteed to complete before returning.  (See
http://lwn.net/Articles/348739/)

The advantage of using O_DIRECT is that there is no fsync/fdatasync()
used.  All of the other wal_sync_methods use fsync/fdatasync(), either
explicitly or implicitly (via the O_SYNC and O_DATASYNC options).
fsync/fdatasync can be very slow on ext3, because it seems to have to
always wait for the current filesystem meta-data transaction to complete,
even if that meta-data operation is completely unrelated to the file
being fsync'ed.  There can be many metadata operations happening on the
data files, so the WAL log fsync can wait for metadata operations on
the data files.  Since O_DIRECT does not do any fsync/fdatasync operation,
it avoids this bottleneck, and can finish more quickly on average.
The open_sync and open_dsync options do not have this benefit, because
they do an equivalent of an fsync/fdatasync after every WAL write.

For the open_sync and open_dsync options, O_DIRECT is used for writes
only if the xlog will not need to be consumed by the archiver or
hot-standby.  I am not keying the open_direct behavior based on whether
XLogIsNeeded() is true, because we see performance gain even when
archiving is enabled (using a simple script that copies and compresses
the log segments).  For 2-processor, 50-warehouse DBT2 run on SLES 11, I
get the following NOTPM results:

                      wal_sync_method
                 fdatasync   open_direct  open_sync

archiving off:     17076       18481       17094
archiving on:      15704       16923       15898


Do folks have any interest in this change, or comments on its
usefulness/correctness?  It would be just an extra option for
wal_sync_method that users can try out and has benefits for certain
configurations.

Dan
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 266c0de..a830a01 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -122,6 +122,7 @@ const struct config_enum_entry sync_method_options[] = {
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
 #endif
+	{"open_direct", SYNC_METHOD_OPEN_DIRECT, false},
 	{NULL, 0, false}
 };
 
@@ -1925,7 +1926,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 		 * fsync more than one file.
 		 */
 		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
+			sync_method != SYNC_METHOD_OPEN_DSYNC &&
+			sync_method != SYNC_METHOD_OPEN_DIRECT)
 		{
 			if (openLogFile >= 0 &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
@@ -8958,6 +8960,15 @@ get_sync_bit(int method)
 		case SYNC_METHOD_OPEN_DSYNC:
 			return OPEN_DATASYNC_FLAG | o_direct_flag;
 #endif
+       case SYNC_METHOD_OPEN_DIRECT:
+			/*
+			 * Open the log with O_DIRECT flag only.  O_DIRECT guarantees
+			 * that data is written to disk when the IO completes if and
+			 * only if the file is fully allocated.  Fortunately, the log
+			 * files are always fully allocated by XLogFileInit() (or are
+			 * recycled from a fully-allocated log).
+			 */
+			return O_DIRECT;
 		default:
 			/* can't happen (unless we are out of sync with option array) */
 			elog(ERROR, "unrecognized wal_sync_method: %d", method);
@@ -9031,6 +9042,7 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg)
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
+		case SYNC_METHOD_OPEN_DIRECT:
 			/* write synced it already */
 			break;
 		default:
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 400c52b..97acde5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -564,3 +564,4 @@
 #------------------------------------------------------------------------------
 
 # Add settings for extensions here
+wal_sync_method = open_direct
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f8aecef..b888ee7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -83,6 +83,7 @@ typedef struct XLogRecord
 #define SYNC_METHOD_OPEN		2		/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4		/* for O_DSYNC */
+#define SYNC_METHOD_OPEN_DIRECT	5		/* for O_DIRECT */
 extern int	sync_method;
 
 /*
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to