Hi,
I create patch which is improvement of checkpoint IO scheduler for stable
transaction responses.
* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint scheduler
has two problems at start and end of checkpoint. One problem is IO heavy when
starting initial checkpoint in rounds of checkpoint. This problem was caused by
full-page-write which cause WAL IO in fast page writes after checkpoint write
page. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrong
judgment that is late schedule by full-page-write, nevertheless checkpoint
schedule is not late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint. Second problem is
fsync freeze problem in end of checkpoint. Normally, checkpoint write is executed
in background by OS's IO scheduler. But when it does not correctly work, end of
checkpoint fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease user-experience
in application service. It is especially serious problem in cloud and virtual
server database system which does not have IO performance. However we don't have
solution in postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it becomes
little bit long that is not problem. You may think that checkpoint_segments and
checkpoint_timeout are set larger value, however large checkpoint_segments
affects file-cache which is not read and is wasted, and large checkpoint_timeout
was caused long-time crash-recovery.
* Improvement method of checkpoint IO scheduler
1. Improvement full-page-write IO heavy problem in start of checkpoint
My idea is very simple. When start of checkpoint, checkpoint_completion_target
become more loose. I set three parameter of this issue;
'checkpoint_smooth_target', 'checkpoint_smooth_margin' and
'checkpointer_write_delay'. 'checkpointer_smooth_target' parameter is a term
point that is smooth checkpoint IO schedule in checkpoint progress.
'checkpoint_smooth_margin' parameter can be more smooth checkpoint schedule. It
is heuristic parameter, but it solves this problem effectively.
'checkpointer_write_delay' parameter is sleep time for checkpoint schedule. This
parameter is nearly same 'bgwriter_delay' in PG9.1 older.
If you want to get more detail information, please see attached patch.
2. Improvement fsync freeze problem in end of checkpoint
When fsync freeze problem was happened, file fsync more repeatedly is
meaningless and causes stop transactions. So I think, if fsync executing time was
long, IO queue is flooded and should give IO priority to transactions for fast
response time. It realize by inserting sleep time during fsync when fsync time
was long. It seems to be long time in checkpoint, but it is not very long. In
fact, when fsync time is long, IO queue is packed by another IO which is included
checkpoint writes, it only gives IO priority to another executing transactions.
I tested my patch in DBT-2 benchmark. Please see result of test. My patch
realize higher transaction and fast response than plain PG. Checkpoint time is
little bit longer than plain PG, but it is not serious.
* Result of DBT-2 with this patch. (Compared with original PG9.2.4)
I use DBT-2 benchmark software by OSDL. I also use pg_statsinfo and
pg_stats_reporter in this benchmark.
- Patched PG (patched 9.2.4)
DBT-2 result: http://goo.gl/1PD3l
statsinfo report: http://goo.gl/UlGAO
settings: http://goo.gl/X4Whu
- Original PG (9.2.4)
DBT-2 result: http://goo.gl/XVxtj
statsinfo report: http://goo.gl/UT1Li
settings: http://goo.gl/eofmb
Measurement Value is improved 4%, 'new-order 90%tile' is improved 20%,
'new-order average' is improved 18%, 'new-order deviation' is improved 24%, and
'new-order maximum' is improved 27%. I confirm high throughput and WAL IO at
executing checkpoint in pg_stats_reporter's report. My patch realizes high
response transactions and non-blocking executing transactions.
Bad point of my patch is longer checkpoint. Checkpoint time was increased about
10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout.
Please see checkpoint result (http://goo.gl/NsbC6).
* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
It is not advertisement of pg_statsinfo and pg_stats_reporter:-) They are free
software. If you have comment and another idea about my patch, please send me.
Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..a66ce36 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,16 +141,21 @@ static CheckpointerShmemStruct *CheckpointerShmem;
/*
* GUC parameters
*/
+int CheckPointerWriteDelay = 200;
int CheckPointTimeout = 300;
int CheckPointWarning = 30;
+int CheckPointerFsyncDelayThreshold = -1;
double CheckPointCompletionTarget = 0.5;
+double CheckPointSmoothTarget = 0.0;
+double CheckPointSmoothMargin = 0.0;
+double CheckPointerFsyncDelayRatio = 0.0;
/*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
/*
* Private state
@@ -169,7 +174,6 @@ static pg_time_t last_xlog_switch_time;
static void CheckArchiveTimeout(void);
static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
static bool CompactCheckpointerRequestQueue(void);
static void UpdateSharedMemoryConfig(void);
@@ -643,7 +647,7 @@ CheckArchiveTimeout(void)
* this does not check the *current* checkpoint's IMMEDIATE flag, but whether
* there is one pending behind it.)
*/
-static bool
+extern bool
ImmediateCheckpointRequested(void)
{
if (checkpoint_requested)
@@ -715,7 +719,7 @@ CheckpointWriteDelay(int flags, double progress)
* Checkpointer and bgwriter are no longer related so take the Big
* Sleep.
*/
- pg_usleep(100000L);
+ pg_usleep(CheckPointerWriteDelay * 1000L);
}
else if (--absorb_counter <= 0)
{
@@ -742,14 +746,35 @@ IsCheckpointOnSchedule(double progress)
{
XLogRecPtr recptr;
struct timeval now;
- double elapsed_xlogs,
+ double original_progress,
+ elapsed_xlogs,
elapsed_time;
Assert(ckpt_active);
- /* Scale progress according to checkpoint_completion_target. */
- progress *= CheckPointCompletionTarget;
-
+ /* This variable is used by smooth checkpoint schedule.*/
+ original_progress = progress * CheckPointCompletionTarget;
+
+ /* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+ if(progress >= CheckPointSmoothTarget)
+ {
+ /* Normal checkpoint schedule. */
+ progress *= CheckPointCompletionTarget;
+ }
+ else
+ {
+ /* Smooth checkpoint schedule.
+ *
+ * When initial checkpoint, it tends to be high IO road average
+ * and slow executing transactions. This schedule reduces them
+ * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget,
+ * it becomes near normal checkpoint schedule. If you want to more
+ * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+ */
+ progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) *
+ (CheckPointSmoothMargin + 1 - CheckPointCompletionTarget)
+ + CheckPointCompletionTarget;
+ }
/*
* Check against the cached value first. Only do the more expensive
* calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_xlogs;
return false;
}
+ else if (original_progress < elapsed_xlogs)
+ {
+ ckpt_cached_elapsed = elapsed_xlogs;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
}
/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
ckpt_cached_elapsed = elapsed_time;
return false;
}
+ else if (original_progress < elapsed_time)
+ {
+ ckpt_cached_elapsed = elapsed_time;
+
+ /* smooth checkpoint write */
+ pg_usleep(CheckPointerWriteDelay * 1000L);
+ return false;
+ }
/* It looks like we're on schedule. */
return true;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..e558eb7 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
*/
#include "postgres.h"
+#include <signal.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/file.h>
@@ -162,6 +163,8 @@ static List *pendingUnlinks = NIL;
static CycleCtr mdsync_cycle_ctr = 0;
static CycleCtr mdckpt_cycle_ctr = 0;
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
typedef enum /* behavior for mdopen & _mdfd_getseg */
{
@@ -1171,6 +1174,18 @@ mdsync(void)
FilePathName(seg->mdfd_vfd),
(double) elapsed / 1000);
+ /*
+ * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio'
+ * for giving priority to executing transaction.
+ */
+ if( CheckPointerFsyncDelayThreshold >= 0 &&
+ !shutdown_requested &&
+ !ImmediateCheckpointRequested() &&
+ (elapsed / 1000 > CheckPointerFsyncDelayThreshold)){
+ pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L);
+ elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+ (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+ }
break; /* out of retry loop */
}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..f3fa5ab 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,30 @@ static struct config_int ConfigureNamesInt[] =
},
{
+ {"checkpointer_write_delay", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during dirty buffers write in checkpoint."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerWriteDelay,
+ 200, 10, 10000,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+ NULL,
+ GUC_UNIT_MS
+ },
+ &CheckPointerFsyncDelayThreshold,
+ -1, -1, 1000000,
+ NULL, NULL, NULL
+ },
+
+
+
+ {
{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
NULL,
@@ -2551,6 +2575,36 @@ static struct config_real ConfigureNamesReal[] =
NULL, NULL, NULL
},
+ {
+ {"checkpoint_smooth_target", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("Smooth control IO load between starting checkpoint and this target parameter in progress of checkpoint."),
+ NULL
+ },
+ &CheckPointSmoothTarget,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpoint_smooth_margin", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("More smooth control IO load between starting checkpoint and checkpoint_smooth_target."),
+ NULL
+ },
+ &CheckPointSmoothMargin,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+ NULL
+ },
+ &CheckPointerFsyncDelayRatio,
+ 0.0, 0.0, 1.0,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..9c07bd8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -185,7 +185,12 @@
#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
+#checkpoint_smooth_target = 0.0 # smooth checkpoint target, 0.0 - 1.0
+#checkpoint_smooth_margin = 0.0 # smooth checkpoint margin, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
+#checkpointer_write_delay = 200ms # 10-10000 milliseconds
+#checkpointer_fsync_delay_ratio = 0.0 # range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 # range 0 - 1000000 milliseconds. -1 is disable.
# - Archiving -
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..5964b99 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,9 +21,14 @@
/* GUC options */
extern int BgWriterDelay;
+extern int CheckPointerWriteDelay;
extern int CheckPointTimeout;
extern int CheckPointWarning;
+extern int CheckPointerFsyncDelayThreshold;
extern double CheckPointCompletionTarget;
+extern double CheckPointSmoothTarget;
+extern double CheckPointSmoothMargin;
+extern double CheckPointerFsyncDelayRatio;
extern void BackgroundWriterMain(void) __attribute__((noreturn));
extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +36,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
extern void RequestCheckpoint(int flags);
extern void CheckpointWriteDelay(int flags, double progress);
+extern bool ImmediateCheckpointRequested(void);
extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
BlockNumber segno);
extern void AbsorbFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
RESOURCES_KERNEL,
RESOURCES_VACUUM_DELAY,
RESOURCES_BGWRITER,
+ RESOURCES_CHECKPOINTER,
RESOURCES_ASYNCHRONOUS,
WAL,
WAL_SETTINGS,
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers