When running Postgres on a single ext3 filesystem on Linux, we find that the attached simple patch gives significant performance benefit (7-8% in numbers below). The patch adds a new option for wal_sync_method, which is "open_direct". With this option, the WAL is always opened with O_DIRECT (but not O_SYNC or O_DSYNC). For Linux, the use of only O_DIRECT should be correct. All WAL logs are fully allocated before being used, and the WAL buffers are 8K-aligned, so all direct writes are guaranteed to complete before returning. (See http://lwn.net/Articles/348739/)
The advantage of using O_DIRECT is that there is no fsync/fdatasync() used. All of the other wal_sync_methods use fsync/fdatasync(), either explicitly or implicitly (via the O_SYNC and O_DATASYNC options). fsync/fdatasync can be very slow on ext3, because it seems to have to always wait for the current filesystem meta-data transaction to complete, even if that meta-data operation is completely unrelated to the file being fsync'ed. There can be many metadata operations happening on the data files, so the WAL log fsync can wait for metadata operations on the data files. Since O_DIRECT does not do any fsync/fdatasync operation, it avoids this bottleneck, and can finish more quickly on average. The open_sync and open_dsync options do not have this benefit, because they do an equivalent of an fsync/fdatasync after every WAL write. For the open_sync and open_dsync options, O_DIRECT is used for writes only if the xlog will not need to be consumed by the archiver or hot-standby. I am not keying the open_direct behavior based on whether XLogIsNeeded() is true, because we see performance gain even when archiving is enabled (using a simple script that copies and compresses the log segments). For 2-processor, 50-warehouse DBT2 run on SLES 11, I get the following NOTPM results: wal_sync_method fdatasync open_direct open_sync archiving off: 17076 18481 17094 archiving on: 15704 16923 15898 Do folks have any interest in this change, or comments on its usefulness/correctness? It would be just an extra option for wal_sync_method that users can try out and has benefits for certain configurations. Dan
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 266c0de..a830a01 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -122,6 +122,7 @@ const struct config_enum_entry sync_method_options[] = { #ifdef OPEN_DATASYNC_FLAG {"open_datasync", SYNC_METHOD_OPEN_DSYNC, false}, #endif + {"open_direct", SYNC_METHOD_OPEN_DIRECT, false}, {NULL, 0, false} }; @@ -1925,7 +1926,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch) * fsync more than one file. */ if (sync_method != SYNC_METHOD_OPEN && - sync_method != SYNC_METHOD_OPEN_DSYNC) + sync_method != SYNC_METHOD_OPEN_DSYNC && + sync_method != SYNC_METHOD_OPEN_DIRECT) { if (openLogFile >= 0 && !XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg)) @@ -8958,6 +8960,15 @@ get_sync_bit(int method) case SYNC_METHOD_OPEN_DSYNC: return OPEN_DATASYNC_FLAG | o_direct_flag; #endif + case SYNC_METHOD_OPEN_DIRECT: + /* + * Open the log with O_DIRECT flag only. O_DIRECT guarantees + * that data is written to disk when the IO completes if and + * only if the file is fully allocated. Fortunately, the log + * files are always fully allocated by XLogFileInit() (or are + * recycled from a fully-allocated log). + */ + return O_DIRECT; default: /* can't happen (unless we are out of sync with option array) */ elog(ERROR, "unrecognized wal_sync_method: %d", method); @@ -9031,6 +9042,7 @@ issue_xlog_fsync(int fd, uint32 log, uint32 seg) #endif case SYNC_METHOD_OPEN: case SYNC_METHOD_OPEN_DSYNC: + case SYNC_METHOD_OPEN_DIRECT: /* write synced it already */ break; default: diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 400c52b..97acde5 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -564,3 +564,4 @@ #------------------------------------------------------------------------------ # Add settings for extensions here +wal_sync_method = open_direct diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index f8aecef..b888ee7 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -83,6 +83,7 @@ typedef struct XLogRecord #define SYNC_METHOD_OPEN 2 /* for O_SYNC */ #define SYNC_METHOD_FSYNC_WRITETHROUGH 3 #define SYNC_METHOD_OPEN_DSYNC 4 /* for O_DSYNC */ +#define SYNC_METHOD_OPEN_DIRECT 5 /* for O_DIRECT */ extern int sync_method; /*
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers