[HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

KONDO Mitsumasa Mon, 10 Jun 2013 03:48:32 -0700

Hi,

I create patch which is improvement of checkpoint IO scheduler for stabletransaction responses.


* Problem in checkpoint IO schedule in heavy transaction case

When heavy transaction in database, I think PostgreSQL checkpoint schedulerhas two problems at start and end of checkpoint. One problem is IO heavy whenstarting initial checkpoint in rounds of checkpoint. This problem was caused byfull-page-write which cause WAL IO in fast page writes after checkpoint writepage. Therefore, when starting checkpoint, WAL-based checkpoint scheduler wrongjudgment that is late schedule by full-page-write, nevertheless checkpointschedule is not late. This is caused bad transaction response. I think WAL-basedcheckpoint scheduler was not property in starting checkpoint. Second problem isfsync freeze problem in end of checkpoint. Normally, checkpoint write is executedin background by OS's IO scheduler. But when it does not correctly work, end ofcheckpoint fsync was caused IO freeze and slower transactions. Unexpected slowtransaction will cause monitor error in HA-cluster and decrease user-experiencein application service. It is especially serious problem in cloud and virtualserver database system which does not have IO performance. However we don't havesolution in postgresql.conf parameter very much. We prefer checkpoint time tofast response transactions. In fact checkpoint time is short, and it becomeslittle bit long that is not problem. You may think that checkpoint_segments andcheckpoint_timeout are set larger value, however large checkpoint_segmentsaffects file-cache which is not read and is wasted, and large checkpoint_timeoutwas caused long-time crash-recovery.



* Improvement method of checkpoint IO scheduler
1. Improvement full-page-write IO heavy problem in start of checkpoint

My idea is very simple. When start of checkpoint, checkpoint_completion_targetbecome more loose. I set three parameter of this issue;'checkpoint_smooth_target', 'checkpoint_smooth_margin' and'checkpointer_write_delay'. 'checkpointer_smooth_target' parameter is a termpoint that is smooth checkpoint IO schedule in checkpoint progress.'checkpoint_smooth_margin' parameter can be more smooth checkpoint schedule. Itis heuristic parameter, but it solves this problem effectively.'checkpointer_write_delay' parameter is sleep time for checkpoint schedule. Thisparameter is nearly same 'bgwriter_delay' in PG9.1 older.

 If you want to get more detail information, please see attached patch.

2. Improvement fsync freeze problem in end of checkpoint

When fsync freeze problem was happened, file fsync more repeatedly ismeaningless and causes stop transactions. So I think, if fsync executing time waslong, IO queue is flooded and should give IO priority to transactions for fastresponse time. It realize by inserting sleep time during fsync when fsync timewas long. It seems to be long time in checkpoint, but it is not very long. Infact, when fsync time is long, IO queue is packed by another IO which is includedcheckpoint writes, it only gives IO priority to another executing transactions.I tested my patch in DBT-2 benchmark. Please see result of test. My patchrealize higher transaction and fast response than plain PG. Checkpoint time islittle bit longer than plain PG, but it is not serious.



* Result of DBT-2 with this patch. (Compared with original PG9.2.4)

I use DBT-2 benchmark software by OSDL. I also use pg_statsinfo andpg_stats_reporter in this benchmark.


  - Patched PG (patched 9.2.4)
    DBT-2 result:     http://goo.gl/1PD3l
    statsinfo report: http://goo.gl/UlGAO
    settings:         http://goo.gl/X4Whu

  - Original PG (9.2.4)
    DBT-2 result:     http://goo.gl/XVxtj
    statsinfo report: http://goo.gl/UT1Li
    settings:         http://goo.gl/eofmb

Measurement Value is improved 4%, 'new-order 90%tile' is improved 20%,'new-order average' is improved 18%, 'new-order deviation' is improved 24%, and'new-order maximum' is improved 27%. I confirm high throughput and WAL IO atexecuting checkpoint in pg_stats_reporter's report. My patch realizes highresponse transactions and non-blocking executing transactions.

Bad point of my patch is longer checkpoint. Checkpoint time was increased about10% - 20%. But it can work correctry on schedule-time in checkpoint_timeout.Please see checkpoint result (http://goo.gl/NsbC6).


* Test server
  Server: HP Proliant DL360 G7
  CPU:    Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB

It is not advertisement of pg_statsinfo and pg_stats_reporter:-) They are freesoftware. If you have comment and another idea about my patch, please send me.


Best Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fdf6625..a66ce36 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -141,16 +141,21 @@ static CheckpointerShmemStruct *CheckpointerShmem;
 /*
  * GUC parameters
  */
+int			CheckPointerWriteDelay = 200;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
+int			CheckPointerFsyncDelayThreshold = -1;
 double		CheckPointCompletionTarget = 0.5;
+double		CheckPointSmoothTarget = 0.0;
+double		CheckPointSmoothMargin = 0.0;
+double		CheckPointerFsyncDelayRatio = 0.0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
-static volatile sig_atomic_t shutdown_requested = false;
+extern volatile sig_atomic_t checkpoint_requested = false;
+extern volatile sig_atomic_t shutdown_requested = false;
 
 /*
  * Private state
@@ -169,7 +174,6 @@ static pg_time_t last_xlog_switch_time;
 
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
@@ -643,7 +647,7 @@ CheckArchiveTimeout(void)
  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
  * there is one pending behind it.)
  */
-static bool
+extern bool
 ImmediateCheckpointRequested(void)
 {
 	if (checkpoint_requested)
@@ -715,7 +719,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * Checkpointer and bgwriter are no longer related so take the Big
 		 * Sleep.
 		 */
-		pg_usleep(100000L);
+		pg_usleep(CheckPointerWriteDelay * 1000L);
 	}
 	else if (--absorb_counter <= 0)
 	{
@@ -742,14 +746,35 @@ IsCheckpointOnSchedule(double progress)
 {
 	XLogRecPtr	recptr;
 	struct timeval now;
-	double		elapsed_xlogs,
+	double		original_progress,
+			elapsed_xlogs,
 				elapsed_time;
 
 	Assert(ckpt_active);
 
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
+	/* This variable is used by smooth checkpoint schedule.*/
+	original_progress = progress * CheckPointCompletionTarget;
+	
+	/* Scale progress according to checkpoint_completion_target and checkpoint_smooth_target. */
+	if(progress >= CheckPointSmoothTarget)
+	{
+		/* Normal checkpoint schedule. */
+		progress *= CheckPointCompletionTarget;
+	}
+	else
+	{
+		/* Smooth checkpoint schedule. 
+ 		 *	 
+ 		 * When initial checkpoint, it tends to be high IO road average 
+ 		 * and slow executing transactions. This schedule reduces them 
+ 		 * and improve IO responce. As 'progress' approximates CheckPointSmoothTarget, 
+ 		 * it becomes near normal checkpoint schedule. If you want to more 
+ 		 * smooth checkpoint schedule, you set higher CheckPointSmoothTarget.
+		 */ 		
+		progress *= ((CheckPointSmoothTarget - progress) / CheckPointSmoothTarget) * 
+				(CheckPointSmoothMargin + 1 - CheckPointCompletionTarget)
+				 + CheckPointCompletionTarget;
+	}
 	/*
 	 * Check against the cached value first. Only do the more expensive
 	 * calculations once we reach the target previously calculated. Since
@@ -779,6 +804,14 @@ IsCheckpointOnSchedule(double progress)
 			ckpt_cached_elapsed = elapsed_xlogs;
 			return false;
 		}
+		else if (original_progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+
+			/* smooth checkpoint write */
+			pg_usleep(CheckPointerWriteDelay * 1000L);
+			return false;
+		}
 	}
 
 	/*
@@ -793,6 +826,14 @@ IsCheckpointOnSchedule(double progress)
 		ckpt_cached_elapsed = elapsed_time;
 		return false;
 	}
+	else if (original_progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		
+		/* smooth checkpoint write */
+		pg_usleep(CheckPointerWriteDelay * 1000L);
+		return false;
+	}
 
 	/* It looks like we're on schedule. */
 	return true;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..e558eb7 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -21,6 +21,7 @@
  */
 #include "postgres.h"
 
+#include <signal.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <sys/file.h>
@@ -162,6 +163,8 @@ static List *pendingUnlinks = NIL;
 static CycleCtr mdsync_cycle_ctr = 0;
 static CycleCtr mdckpt_cycle_ctr = 0;
 
+extern volatile sig_atomic_t checkpoint_requested;
+extern volatile sig_atomic_t shutdown_requested;
 
 typedef enum					/* behavior for mdopen & _mdfd_getseg */
 {
@@ -1171,6 +1174,18 @@ mdsync(void)
 								 FilePathName(seg->mdfd_vfd),
 								 (double) elapsed / 1000);
 
+						/* 
+						 * If this fsync has long time, we sleep 'fsync-time * checkpoint_fsync_delay_ratio' 
+ 						 * for giving priority to executing transaction.
+ 						 */
+						if( CheckPointerFsyncDelayThreshold >= 0 &&
+							!shutdown_requested &&
+							!ImmediateCheckpointRequested() &&
+							(elapsed / 1000 > CheckPointerFsyncDelayThreshold)){
+							pg_usleep((elapsed / 1000) * CheckPointerFsyncDelayRatio * 1000L);
+							elog(DEBUG1, "checkpoint sync sleep: time=%.3f msec",
+                                                                 (double) (elapsed / 1000) * CheckPointerFsyncDelayRatio);
+							}
 						break;	/* out of retry loop */
 					}
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ea16c64..f3fa5ab 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2014,6 +2014,30 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+                {"checkpointer_write_delay", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("checkpointer sleep time during dirty buffers write in checkpoint."),
+                        NULL,
+			GUC_UNIT_MS
+                },
+                &CheckPointerWriteDelay,
+                200, 10, 10000,
+                NULL, NULL, NULL
+        },
+
+        {
+                {"checkpointer_fsync_delay_threshold", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("If a file fsync time over this threshold, checkpointer sleep file_fsync_time * checkpointer_fsync_delay_ratio."),
+                        NULL,
+			GUC_UNIT_MS
+                },
+                &CheckPointerFsyncDelayThreshold,
+                -1, -1, 1000000,
+                NULL, NULL, NULL
+        },
+
+
+
+	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
 			NULL,
@@ -2551,6 +2575,36 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+        {
+                {"checkpoint_smooth_target", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+                        gettext_noop("Smooth control IO load between starting checkpoint and this target parameter in progress of checkpoint."),
+                        NULL
+                },
+                &CheckPointSmoothTarget,
+                0.0, 0.0, 1.0,
+                NULL, NULL, NULL
+        },
+
+	{
+		{"checkpoint_smooth_margin", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("More smooth control IO load between starting checkpoint and checkpoint_smooth_target."),
+		NULL
+		},
+		&CheckPointSmoothMargin,
+		0.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checkpointer_fsync_delay_ratio", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+		gettext_noop("checkpointer sleep time during file fsync in checkpoint."),
+		NULL
+		},
+		&CheckPointerFsyncDelayRatio,
+		0.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0303ac7..9c07bd8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -185,7 +185,12 @@
 #checkpoint_segments = 3		# in logfile segments, min 1, 16MB each
 #checkpoint_timeout = 5min		# range 30s-1h
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
+#checkpoint_smooth_target = 0.0		# smooth checkpoint target, 0.0 - 1.0
+#checkpoint_smooth_margin = 0.0		# smooth checkpoint margin, 0.0 - 1.0
 #checkpoint_warning = 30s		# 0 disables
+#checkpointer_write_delay = 200ms	# 10-10000 milliseconds
+#checkpointer_fsync_delay_ratio = 0.0	# range 0.0 - 1.0
+#checkpointer_fsync_delay_threshold = -1 	# range 0 - 1000000 milliseconds. -1 is disable.
 
 # - Archiving -
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 46d3c26..5964b99 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -21,9 +21,14 @@
 
 /* GUC options */
 extern int	BgWriterDelay;
+extern int	CheckPointerWriteDelay;
 extern int	CheckPointTimeout;
 extern int	CheckPointWarning;
+extern int	CheckPointerFsyncDelayThreshold;
 extern double CheckPointCompletionTarget;
+extern double CheckPointSmoothTarget;
+extern double CheckPointSmoothMargin;
+extern double CheckPointerFsyncDelayRatio;
 
 extern void BackgroundWriterMain(void) __attribute__((noreturn));
 extern void CheckpointerMain(void) __attribute__((noreturn));
@@ -31,6 +36,7 @@ extern void CheckpointerMain(void) __attribute__((noreturn));
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
+extern bool ImmediateCheckpointRequested(void);
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
 					BlockNumber segno);
 extern void AbsorbFsyncRequests(void);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 8dcdd4b..efc5ee4 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -63,6 +63,7 @@ enum config_group
 	RESOURCES_KERNEL,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
+	RESOURCES_CHECKPOINTER,
 	RESOURCES_ASYNCHRONOUS,
 	WAL,
 	WAL_SETTINGS,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to