On 22 June 2017 at 22:52, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun...@enterprisedb.com> 
> wrote:
>> [ new patch ]
> I think this is looking better.  I have some suggestions:
> * I suggest renaming launch_autoprewarm_dump() to
> autoprewarm_start_worker().  I think that will be clearer.  Remember
> that user-visible names, internal names, and the documentation should
> all match.


I like related functions and GUCs to be similarly named so that they
have the same prefix.

> * I think the GUC could be pg_prewarm.autoprewarm rather than
> pg_prewarm.enable_autoprewarm.  It's shorter and, I think, no less
> clear.


I also think pg_prewarm.dump_interval should be renamed to

> * In the documentation, don't say "This is a SQL callable function
> to....".  This is a list of SQL-callable functions, so each thing in
> the list is one.  Just delete this from the beginning of each
> sentence.

I've made a pass at the documentation and ended up removing those
intros.  I haven't made any GUC/function renaming changes, but I have
rewritten some paragraphs for clarity.  Updated patch attached.

One thing I couldn't quite make sense of is:

"The autoprewarm process will start loading blocks recorded in
$PGDATA/autoprewarm.blocks until there is a free buffer left in the
buffer pool."

Is this saying "until there is a single free buffer remaining in
shared buffers"?  I haven't corrected or clarified this as I don't
understand it.

Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
detect an autoprewarm process already running.  I'd want this to
return NULL or an error if called for a 2nd time.

> * The reason for the AT_PWARM_* naming is not very obvious.  Does AT
> mean "at" or "auto" or something else?  How about
> * Instead of defining apw_sigusr1_handler, I think you could just use
> procsignal_sigusr1_handler.  Instead of defining apw_sigterm_handler,
> perhaps you could just use die().  got_sigterm would go away, and
> you'd just CHECK_FOR_INTERRUPTS().
> * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse
> reset_apw_state(), which might be better named detach_apw_shmem().
> Similarly, init_apw_state() could be init_apw_shmem().
> * Instead of load_one_database(), I suggest
> autoprewarm_database_main().  That is more parallel to
> autoprewarm_main(), which you have elsewhere, and makes it more
> obvious that it's the main entrypoint for some background worker.
> * Instead of launch_and_wait_for_per_database_worker(), I suggest
> autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I
> suggest autoprewarm_buffers().   The motivation for changing prewarm
> to autoprewarm is that we want the names here to be clearly distinct
> from the other parts of pg_prewarm that are not related to
> autoprewarm.  The motivation for changing buffer_pool to buffers is
> just that it's a little shorter.  Personally I  also like the sound it
> of it better, but YMMV.
> * prewarm_buffer_pool() ends with a useless return statement.  I
> suggest removing it.
> * Instead of creating our own buffering system via buffer_file_write()
> and buffer_file_flush(), why not just use the facilities provided by
> the operating system?  fopen() et. al. provide buffering, and we have
> AllocateFile() to provide a FILE *; it's just like
> OpenTransientFile(), which you are using, but you'll get the buffering
> stuff for free.  Maybe there's some reason why this won't work out
> nicely, but off-hand it seems like it might.  It looks like you are
> already using AllocateFile() to read the dump, so using it to write
> the dump as well seems like it would be logical.
> * I think that it would be cool if, when autoprewarm completed, it
> printed a message at LOG rather than DEBUG1, and with a few more
> details, like "autoprewarm successfully prewarmed %d of %d
> previously-loaded blocks".  This would require some additional
> tracking that you haven't got right now; you'd have to keep track not
> only of the number of blocks read from the file but how many of those
> some worker actually loaded.  You could do that with an extra counter
> in the shared memory area that gets incremented by the per-database
> workers.
> * dump_block_info_periodically() calls ResetLatch() immediately before
> WaitLatch; that's backwards.  See the commit message for commit
> 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to
> the top of latch.h for details on how to do this correctly.
> * dump_block_info_periodically()'s main loop is a bit confusing.  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix that by moving the got_sighup test to the
> top of the loop, which is a good idea anyway for at least two other
> reasons.  First, you probably want to check for a pending SIGHUP on
> initially entering this function, because something might have changed
> during the prewarm phase, and second, see the previous comment about
> using the "another valid coding pattern" from latch.h, which puts the
> ResetLatch() at the bottom of the loop.
> * I think that launch_autoprewarm_dump() should ereport(ERROR, ...)
> rather than just return NULL if the feature is disabled.  Maybe
> something like ... ERROR: pg_prewarm.dump_interval must be
> non-negative in order to launch worker
> * Not sure about this one, but maybe we should consider just getting
> rid of pg_prewarm.dump_interval = -1 altogether and make the minimum
> value 0. If pg_prewarm.autoprewarm = on, then we start the worker and
> dump according to the dump interval; if pg_prewarm.autoprewarm = off
> then we don't start the worker automatically, but we still let you
> start it manually.  If you do, it respects the configured
> dump_interval.  With this design, we don't need the error suggested in
> the previous item at all, and the code can be simplified in various
> places --- all the checks for AT_PWARM_OFF go away.  And I don't see
> that we're really losing anything.  There's not much sense in dumping
> but not prewarming or prewarming but not dumping, so having
> pg_prewarm.autoprewarm configure whether the worker is started
> automatically rather than whether it prewarms (with a separate control
> for whether it dumps) seems to make sense.  The one time when you want
> to do one without the other is when you first install the extension --
> during the first server lifetime, you'll want to dump, so that after
> the next restart you have something to preload.  But this design would
> allow that.

diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile
index 7ad941e..88580d1 100644
--- a/contrib/pg_prewarm/Makefile
+++ b/contrib/pg_prewarm/Makefile
@@ -1,10 +1,10 @@
 # contrib/pg_prewarm/Makefile
 MODULE_big = pg_prewarm
-OBJS = pg_prewarm.o $(WIN32RES)
+OBJS = pg_prewarm.o autoprewarm.o $(WIN32RES)
 EXTENSION = pg_prewarm
-DATA = pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql
+DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql
 PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache"
 ifdef USE_PGXS
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
new file mode 100644
index 0000000..f84fa4a
--- /dev/null
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -0,0 +1,1109 @@
+ *
+ * autoprewarm.c
+ *		Automatically prewarms the shared buffer pool when server restarts.
+ *
+ *
+ *		Autoprewarm is a bgworker process that automatically records the
+ *		information about blocks which were present in buffer pool before
+ *		server shutdown. Then prewarms the buffer pool on server restart
+ *		with those blocks.
+ *
+ *		How does it work? When the shared library "pg_prewarm" is preloaded, a
+ *		bgworker "autoprewarm" is launched immediately after the server has
+ *		reached a consistent state. The bgworker will start loading blocks
+ *		recorded until there is no free buffer left in the buffer pool. This
+ *		way we do not replace any new blocks which were loaded either by the
+ *		recovery process or the querying clients.
+ *
+ *		Once the "autoprewarm" bgworker has completed its prewarm task, it will
+ *		start a new task to periodically dump the BlockInfoRecords related to
+ *		the blocks which are currently in shared buffer pool. On next server
+ *		restart, the bgworker will prewarm the buffer pool by loading those
+ *		blocks. The GUC pg_prewarm.dump_interval will control the dumping
+ *		activity of the bgworker.
+ *
+ *	Copyright (c) 2016-2017, PostgreSQL Global Development Group
+ *
+ *		contrib/pg_prewarm/autoprewarm.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include <unistd.h>
+/* These are always necessary for a bgworker. */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+/* These are necessary for prewarm utilities. */
+#include "access/heapam.h"
+#include "access/xact.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/dsm.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/relfilenodemap.h"
+#include "utils/resowner.h"
+#define AT_PWARM_OFF -1
+#define AUTOPREWARM_FILE "autoprewarm.blocks"
+/* Primary functions */
+void		_PG_init(void);
+void		autoprewarm_main(Datum main_arg);
+static void dump_block_info_periodically(void);
+static pid_t autoprewarm_dump_launcher(void);
+static void setup_autoprewarm(BackgroundWorker *autoprewarm,
+				  const char *worker_name,
+				  const char *worker_function,
+				  Datum main_arg, int restart_time,
+				  int extra_flags);
+void		load_one_database(Datum main_arg);
+ * Signal Handlers.
+ */
+static void apw_sigterm_handler(SIGNAL_ARGS);
+static void apw_sighup_handler(SIGNAL_ARGS);
+static void apw_sigusr1_handler(SIGNAL_ARGS);
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sighup = false;
+ *	Signal handler for SIGTERM
+ *	Set a flag to handle.
+ */
+static void
+	int			save_errno = errno;
+	got_sigterm = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+	errno = save_errno;
+ *	Signal handler for SIGHUP
+ *	Set a flag to reread the config file.
+ */
+static void
+	int			save_errno = errno;
+	got_sighup = true;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+	errno = save_errno;
+ *	Signal handler for SIGUSR1.
+ *	The prewarm workers notify with SIGUSR1 on their startup/shutdown.
+ */
+static void
+	int			save_errno = errno;
+	if (MyProc)
+		SetLatch(&MyProc->procLatch);
+	errno = save_errno;
+/* ============================================================================
+ * ==============	Types and variables used by autoprewarm   =============
+ * ============================================================================
+ */
+/* Metadata of each persistent block which is dumped and used for loading. */
+typedef struct BlockInfoRecord
+	Oid			database;
+	Oid			tablespace;
+	Oid			filenode;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+} BlockInfoRecord;
+/* Tasks performed by autoprewarm workers.*/
+typedef enum
+	TASK_PREWARM_BUFFERPOOL,	/* prewarm the buffer pool. */
+	TASK_DUMP_BUFFERPOOL_INFO	/* dump the buffer pool block info. */
+} AutoPrewarmTask;
+/* Shared state information for autoprewarm bgworker. */
+typedef struct AutoPrewarmSharedState
+	LWLock		lock;			/* mutual exclusion */
+	pid_t		bgworker_pid;	/* for main bgworker */
+	pid_t		pid_using_dumpfile;		/* for autoprewarm or block dump */
+	bool		skip_prewarm_on_restart;		/* if set true, prewarm task
+												 * will not be done */
+	/* Following items are for communication with per-database worker */
+	dsm_handle	block_info_handle;
+	Oid			database;
+	int			prewarm_start_idx;
+	int			prewarm_stop_idx;
+} AutoPrewarmSharedState;
+static AutoPrewarmSharedState *apw_state = NULL;
+ * This data structure represents buffered file.
+ */
+typedef struct BufferFile
+	char		transient_dump_file_path[MAXPGPATH];	/* actual file to be
+														 * written */
+	int			fd;				/* file descriptor to above file */
+	char		buf[BLCKSZ];	/* buffer used before writing to file */
+	int			pos;			/* next write position in buffer. */
+}	BufferFile;
+/* GUC variable that controls the dump activity of autoprewarm. */
+static int	dump_interval = 0;
+ * GUC variable to decide whether autoprewarm worker should be started when
+ * preloaded.
+ */
+static bool enable_autoprewarm = true;
+/* Compare member elements to check whether they are not equal. */
+#define cmp_member_elem(fld)	\
+do { \
+	if (a->fld < b->fld)		\
+		return -1;				\
+	else if (a->fld > b->fld)	\
+		return 1;				\
+} while(0);
+ * blockinfo_cmp
+ *		Compare function used for qsort().
+ */
+static int
+blockinfo_cmp(const void *p, const void *q)
+	BlockInfoRecord *a = (BlockInfoRecord *) p;
+	BlockInfoRecord *b = (BlockInfoRecord *) q;
+	cmp_member_elem(database);
+	cmp_member_elem(tablespace);
+	cmp_member_elem(filenode);
+	cmp_member_elem(forknum);
+	cmp_member_elem(blocknum);
+	return 0;
+/* ============================================================================
+ * =====================	Prewarm part of autoprewarm =======================
+ * ============================================================================
+ */
+ * reset_apw_state
+ *		on_apw_exit reset the prewarm state
+ */
+static void
+reset_apw_state(int code, Datum arg)
+	if (apw_state->pid_using_dumpfile == MyProcPid)
+		apw_state->pid_using_dumpfile = InvalidPid;
+	if (apw_state->bgworker_pid == MyProcPid)
+		apw_state->bgworker_pid = InvalidPid;
+ * init_apw_state
+ *		Allocate and initialize autoprewarm related shared memory.
+ */
+static void
+	bool		found = false;
+	LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+	apw_state = ShmemInitStruct("autoprewarm",
+								sizeof(AutoPrewarmSharedState),
+								&found);
+	if (!found)
+	{
+		/* First time through ... */
+		LWLockInitialize(&apw_state->lock, LWLockNewTrancheId());
+		apw_state->bgworker_pid = InvalidPid;
+		apw_state->pid_using_dumpfile = InvalidPid;
+		apw_state->skip_prewarm_on_restart = false;
+	}
+	LWLockRelease(AddinShmemInitLock);
+ * load_one_database
+ *		This subroutine loads the BlockInfoRecords of the database set in
+ *		AutoPrewarmSharedState.
+ *
+ * Connect to the database and load the blocks of that database which are given
+ * by [apw_state->prewarm_start_idx, apw_state->prewarm_stop_idx).
+ */
+load_one_database(Datum main_arg)
+	uint32		pos;
+	BlockInfoRecord *block_info;
+	Relation	rel = NULL;
+	BlockNumber nblocks = 0;
+	BlockInfoRecord *old_blk;
+	dsm_segment *seg;
+	/* Establish signal handlers before unblocking signals. */
+	pqsignal(SIGTERM, apw_sigterm_handler);
+	pqsignal(SIGHUP, apw_sighup_handler);
+	/* We're now ready to receive signals */
+	BackgroundWorkerUnblockSignals();
+	init_apw_state();
+	seg = dsm_attach(apw_state->block_info_handle);
+	if (seg == NULL)
+		ereport(ERROR,
+				 errmsg("could not map dynamic shared memory segment")));
+	block_info = (BlockInfoRecord *) dsm_segment_address(seg);
+	BackgroundWorkerInitializeConnectionByOid(apw_state->database, InvalidOid);
+	old_blk = NULL;
+	pos = apw_state->prewarm_start_idx;
+	while (!got_sigterm && pos < apw_state->prewarm_stop_idx &&
+		   have_free_buffer())
+	{
+		BlockInfoRecord *blk = &block_info[pos++];
+		Buffer		buf;
+		/*
+		 * Quit if we've reached records for another database. If previous
+		 * blocks are of some global objects, then continue pre-warming.
+		 */
+		if (old_blk != NULL && old_blk->database != blk->database &&
+			old_blk->database != 0)
+			break;
+		/*
+		 * As soon as we encounter a block of a new relation, close the old
+		 * relation. Note, that rel will be NULL if try_relation_open failed
+		 * previously, in that case there is nothing to close.
+		 */
+		if (old_blk != NULL && old_blk->filenode != blk->filenode &&
+			rel != NULL)
+		{
+			relation_close(rel, AccessShareLock);
+			rel = NULL;
+			CommitTransactionCommand();
+		}
+		/*
+		 * Try to open each new relation, but only once, when we first
+		 * encounter it. If it's been dropped, skip the associated blocks.
+		 */
+		if (old_blk == NULL || old_blk->filenode != blk->filenode)
+		{
+			Oid			reloid;
+			Assert(rel == NULL);
+			StartTransactionCommand();
+			reloid = RelidByRelfilenode(blk->tablespace, blk->filenode);
+			if (OidIsValid(reloid))
+				rel = try_relation_open(reloid, AccessShareLock);
+			if (!rel)
+				CommitTransactionCommand();
+		}
+		if (!rel)
+		{
+			old_blk = blk;
+			continue;
+		}
+		/* Once per fork, check for fork existence and size. */
+		if (old_blk == NULL ||
+			old_blk->filenode != blk->filenode ||
+			old_blk->forknum != blk->forknum)
+		{
+			RelationOpenSmgr(rel);
+			/*
+			 * smgrexists is not safe for illegal forknum, hence check whether
+			 * the passed forknum is valid before using it in smgrexists.
+			 */
+			if (blk->forknum > InvalidForkNumber &&
+				blk->forknum <= MAX_FORKNUM &&
+				smgrexists(rel->rd_smgr, blk->forknum))
+				nblocks = RelationGetNumberOfBlocksInFork(rel, blk->forknum);
+			else
+				nblocks = 0;
+		}
+		/* Check whether blocknum is valid and within fork file size. */
+		if (blk->blocknum >= nblocks)
+		{
+			/* Move to next forknum. */
+			old_blk = blk;
+			continue;
+		}
+		/* Prewarm buffer. */
+		buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL,
+								 NULL);
+		if (BufferIsValid(buf))
+			ReleaseBuffer(buf);
+		old_blk = blk;
+	}
+	dsm_detach(seg);
+	/* Release lock on previous relation. */
+	if (rel)
+	{
+		relation_close(rel, AccessShareLock);
+		CommitTransactionCommand();
+	}
+	return;
+ * launch_and_wait_for_per_database_worker
+ *		Register a per-database dynamic worker to load.
+ */
+static void
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	BgwHandleStatus status PG_USED_FOR_ASSERTS_ONLY;
+	setup_autoprewarm(&worker, "autoprewarm", "load_one_database",
+					  (Datum) NULL, BGW_NEVER_RESTART,
+	/* Set bgw_notify_pid so that we can use WaitForBackgroundWorkerShutdown */
+	worker.bgw_notify_pid = MyProcPid;
+	if (!RegisterDynamicBackgroundWorker(&worker, &handle))
+	{
+		ereport(ERROR,
+				 errmsg("registering dynamic bgworker autoprewarm failed"),
+				 errhint("Consider increasing configuration parameter \"max_worker_processes\".")));
+	}
+	status = WaitForBackgroundWorkerShutdown(handle);
+	Assert(status == BGWH_STOPPED);
+ * prewarm_buffer_pool
+ *		The main routine that prewarms the buffer pool.
+ *
+ * The prewarm bgworker will first load all the BlockInfoRecords in
+ * $PGDATA/AUTOPREWARM_FILE to a DSM. Further, these BlockInfoRecords are
+ * separated based on their databases. Finally, for each group of
+ * BlockInfoRecords a per-database worker will be launched to load the
+ * corresponding blocks. Launch the next worker only after the previous one has
+ * finished its job.
+ */
+static void
+	FILE	   *file = NULL;
+	uint32		num_elements,
+				i;
+	BlockInfoRecord *blkinfo;
+	dsm_segment *seg;
+	/*
+	 * Since there can be at most one worker for prewarm, locking is not
+	 * required for setting skip_prewarm_on_restart.
+	 */
+	apw_state->skip_prewarm_on_restart = true;
+	LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE);
+	if (apw_state->pid_using_dumpfile == InvalidPid)
+		apw_state->pid_using_dumpfile = MyProcPid;
+	else
+	{
+		LWLockRelease(&apw_state->lock);
+		ereport(LOG,
+				(errmsg("skipping prewarm because block dump file is being written by PID %d",
+						apw_state->pid_using_dumpfile)));
+		return;
+	}
+	LWLockRelease(&apw_state->lock);
+	file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R);
+	if (!file)
+	{
+		if (errno != ENOENT)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m",
+		apw_state->pid_using_dumpfile = InvalidPid;
+		return;					/* No file to load. */
+	}
+	if (fscanf(file, "<<%u>>i\n", &num_elements) != 1)
+	{
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from file \"%s\": %m",
+	}
+	seg = dsm_create(sizeof(BlockInfoRecord) * num_elements, 0);
+	blkinfo = (BlockInfoRecord *) dsm_segment_address(seg);
+	for (i = 0; i < num_elements; i++)
+	{
+		/* Get next block. */
+		if (5 != fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database,
+						&blkinfo[i].tablespace, &blkinfo[i].filenode,
+						(uint32 *) &blkinfo[i].forknum, &blkinfo[i].blocknum))
+			break;
+	}
+	FreeFile(file);
+	if (num_elements != i)
+		elog(ERROR, "autoprewarm block dump has %u entries but expected %u",
+			 i, num_elements);
+	/*
+	 * Sort the block number to increase the chance of sequential reads during
+	 * load.
+	 */
+	pg_qsort(blkinfo, num_elements, sizeof(BlockInfoRecord), blockinfo_cmp);
+	apw_state->block_info_handle = dsm_segment_handle(seg);
+	apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0;
+	/* Get the info position of the first block of the next database. */
+	while (apw_state->prewarm_start_idx < num_elements)
+	{
+		uint32		i = apw_state->prewarm_start_idx;
+		Oid			current_db = blkinfo[i].database;
+		/*
+		 * Advance the prewarm_stop_idx to the first BlockRecordInfo that does
+		 * not belong to this database.
+		 */
+		i++;
+		while (i < num_elements)
+		{
+			if (current_db != blkinfo[i].database)
+			{
+				/*
+				 * Combine BlockRecordInfos of global object with the next
+				 * non-global object.
+				 */
+				if (current_db != InvalidOid)
+					break;
+				current_db = blkinfo[i].database;
+			}
+			i++;
+		}
+		/*
+		 * If we reach this point with current_db == InvalidOid, then only
+		 * BlockRecordInfos belonging to global objects exist. Since, we can
+		 * not connect with InvalidOid skip prewarming for these objects.
+		 */
+		if (current_db == InvalidOid)
+			break;
+		apw_state->prewarm_stop_idx = i;
+		apw_state->database = current_db;
+		Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx);
+		/*
+		 * Register a per-database worker to load blocks of the database. Wait
+		 * until it has finished before starting the next worker.
+		 */
+		launch_and_wait_for_per_database_worker();
+		apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx;
+	}
+	dsm_detach(seg);
+	apw_state->block_info_handle = DSM_HANDLE_INVALID;
+	apw_state->pid_using_dumpfile = InvalidPid;
+	ereport(DEBUG1,
+			(errmsg("autoprewarm load task ended")));
+	return;
+ * ============================================================================
+ * ===================== Dump part of Autoprewarm =============================
+ * ============================================================================
+ */
+ * This submodule is for periodically dumping BlockRecordInfos in buffer pool
+ * into a dump file AUTOPREWARM_FILE.
+ * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
+ * forknum, blocknum. Note that this is in the text form so that the dump
+ * information is readable and can be edited, if required.
+ */
+ * buffer_file_flush
+ *		Unload the buffer contents to actual file.
+ *
+ */
+static void
+buffer_file_flush(BufferFile * file)
+	ssize_t		w_size;
+	char	   *buf = file->buf;
+	while (file->pos)
+	{
+		/* write to file until an error */
+		w_size = write(file->fd, buf, file->pos);
+		if (w_size > 0)
+		{
+			file->pos -= w_size;
+			buf += w_size;
+		}
+		else
+		{
+			int			save_errno = errno;
+			CloseTransientFile(file->fd);
+			unlink(file->transient_dump_file_path);
+			errno = save_errno;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\" : %m",
+							file->transient_dump_file_path)));
+		}
+	}
+ * buffer_file_write
+ *		First accumulate the contents in a BLCKSZ buffer then unload it to
+ *		actual file.
+ */
+static void
+buffer_file_write(BufferFile * file, char *block_info, int block_info_len)
+	Assert(block_info_len <= BLCKSZ);
+	/* If we exceed the buffer size unload buffer to actual file. */
+	if ((file->pos + block_info_len) > BLCKSZ)
+		buffer_file_flush(file);
+	memcpy(file->buf + file->pos, block_info, block_info_len);
+	file->pos += block_info_len;
+ * dump_now
+ *		Dumps BlockRecordInfos in buffer pool.
+ */
+static uint32
+dump_now(bool is_bgworker)
+	uint32		i;
+	int			ret,
+				block_info_len;
+	uint32		num_blocks;
+	BlockInfoRecord *block_info_array;
+	BufferDesc *bufHdr;
+	BufferFile *file;
+	char		block_info[1024];
+	LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE);
+	if (apw_state->pid_using_dumpfile == InvalidPid)
+		apw_state->pid_using_dumpfile = MyProcPid;
+	else
+	{
+		LWLockRelease(&apw_state->lock);
+		if (!is_bgworker)
+			ereport(ERROR,
+					(errmsg("could not perform block dump because dump file is being used by PID %d",
+							apw_state->pid_using_dumpfile)));
+		ereport(LOG,
+				(errmsg("skipping block dump because it is already being performed by PID %d",
+						apw_state->pid_using_dumpfile)));
+		return 0;
+	}
+	LWLockRelease(&apw_state->lock);
+	block_info_array =
+		(BlockInfoRecord *) palloc(sizeof(BlockInfoRecord) * NBuffers);
+	for (num_blocks = 0, i = 0; i < NBuffers; i++)
+	{
+		uint32		buf_state;
+		/* In case of a SIGHUP, just reload the configuration. */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		/* Have we been asked to stop dump? */
+		if (dump_interval == AT_PWARM_OFF)
+		{
+			pfree(block_info_array);
+			return 0;
+		}
+		bufHdr = GetBufferDescriptor(i);
+		/* Lock each buffer header before inspecting. */
+		buf_state = LockBufHdr(bufHdr);
+		if (buf_state & BM_TAG_VALID)
+		{
+			block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
+			block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
+			block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
+			block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
+			block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
+			++num_blocks;
+		}
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+	file = (BufferFile *) palloc(sizeof(BufferFile));
+	snprintf(file->transient_dump_file_path, MAXPGPATH, "%s.tmp",
+	file->fd = OpenTransientFile(file->transient_dump_file_path,
+							 O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, 0666);
+	if (file->fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open \"%s\": %m",
+						file->transient_dump_file_path)));
+	file->pos = 0;
+	block_info_len = sprintf(block_info, "<<%u>>\n", num_blocks);
+	buffer_file_write(file, block_info, block_info_len);
+	for (i = 0; i < num_blocks; i++)
+	{
+		/* In case of a SIGHUP, just reload the configuration. */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+		/* Have we been asked to stop dump? */
+		if (dump_interval == AT_PWARM_OFF)
+		{
+			pfree(block_info_array);
+			CloseTransientFile(file->fd);
+			unlink(file->transient_dump_file_path);
+			pfree(file);
+			return 0;
+		}
+		block_info_len = sprintf(block_info, "%u,%u,%u,%u,%u\n",
+								 block_info_array[i].database,
+								 block_info_array[i].tablespace,
+								 block_info_array[i].filenode,
+								 (uint32) block_info_array[i].forknum,
+								 block_info_array[i].blocknum);
+		buffer_file_write(file, block_info, block_info_len);
+	}
+	pfree(block_info_array);
+	/* Write remaining buffer contents to actual file. */
+	buffer_file_flush(file);
+	/*
+	 * Rename transient_dump_file_path to AUTOPREWARM_FILE to make things
+	 * permanent.
+	 */
+	ret = CloseTransientFile(file->fd);
+	if (ret != 0)
+	{
+		int			save_errno = errno;
+		unlink(file->transient_dump_file_path);
+		errno = save_errno;
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\" : %m",
+						file->transient_dump_file_path)));
+	}
+	(void) durable_rename(file->transient_dump_file_path, AUTOPREWARM_FILE,
+						  ERROR);
+	pfree(file);
+	apw_state->pid_using_dumpfile = InvalidPid;
+	ereport(DEBUG1,
+			(errmsg("saved metadata info of %d blocks", num_blocks)));
+	return num_blocks;
+ * dump_block_info_periodically
+ *		 This loop periodically call dump_now().
+ *
+ * Call dum_now() at regular intervals defined by GUC variable dump_interval.
+ */
+	TimestampTz last_dump_time = 0;
+	while (!got_sigterm)
+	{
+		int			rc;
+		struct timeval nap;
+		nap.tv_usec = 0;
+		/* Have we been asked to stop dumping? */
+		if (dump_interval == AT_PWARM_OFF)
+			return;
+		if (dump_interval > AT_PWARM_DUMP_AT_SHUTDOWN_ONLY)
+		{
+			TimestampTz current_time = GetCurrentTimestamp();
+			if (last_dump_time == 0 ||
+				TimestampDifferenceExceeds(last_dump_time,
+										   current_time,
+										   (dump_interval * 1000)))
+			{
+				dump_now(true);
+				/*
+				 * It is better to stop when shutdown signal is received
+				 * during or right after a dump.
+				 */
+				if (got_sigterm)
+					return;
+				last_dump_time = GetCurrentTimestamp();
+				nap.tv_sec = dump_interval;
+				nap.tv_usec = 0;
+			}
+			else
+			{
+				long		secs;
+				int			usecs;
+				TimestampDifference(last_dump_time, current_time,
+									&secs, &usecs);
+				nap.tv_sec = dump_interval - secs;
+				nap.tv_usec = 0;
+			}
+		}
+		else
+			last_dump_time = 0;
+		ResetLatch(&MyProc->procLatch);
+		rc = WaitLatch(&MyProc->procLatch,
+					   (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L),
+			proc_exit(1);
+		/* In case of a SIGHUP, just reload the configuration. */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+	}
+	/* It's time for postmaster shutdown, let's dump for one last time. */
+	if (dump_interval != AT_PWARM_OFF)
+		dump_now(true);
+ * autoprewarm_main
+ *		The main entry point of autoprewarm bgworker process.
+ */
+autoprewarm_main(Datum main_arg)
+	AutoPrewarmTask todo_task;
+	/* Establish signal handlers before unblocking signals. */
+	pqsignal(SIGTERM, apw_sigterm_handler);
+	pqsignal(SIGHUP, apw_sighup_handler);
+	pqsignal(SIGUSR1, apw_sigusr1_handler);
+	/* We're now ready to receive signals. */
+	BackgroundWorkerUnblockSignals();
+	todo_task = DatumGetInt32(main_arg);
+	Assert(todo_task == TASK_PREWARM_BUFFERPOOL ||
+		   todo_task == TASK_DUMP_BUFFERPOOL_INFO);
+	init_apw_state();
+	LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE);
+	if (apw_state->bgworker_pid != InvalidPid)
+	{
+		LWLockRelease(&apw_state->lock);
+		ereport(LOG,
+				(errmsg("autoprewarm worker is already running under PID %d",
+						apw_state->bgworker_pid)));
+		return;
+	}
+	apw_state->bgworker_pid = MyProcPid;
+	LWLockRelease(&apw_state->lock);
+	on_shmem_exit(reset_apw_state, 0);
+	ereport(LOG,
+			(errmsg("autoprewarm worker started")));
+	/*
+	 * We have finished initializing worker's state, let's start actual work.
+	 */
+	if (todo_task == TASK_PREWARM_BUFFERPOOL &&
+		!apw_state->skip_prewarm_on_restart)
+		prewarm_buffer_pool();
+	dump_block_info_periodically();
+	ereport(LOG,
+			(errmsg("autoprewarm worker stopped")));
+/* ============================================================================
+ * =============	Extension's entry functions/utilities	===================
+ * ============================================================================
+ */
+ * setup_autoprewarm
+ *		A common function to initialize BackgroundWorker structure.
+ */
+static void
+setup_autoprewarm(BackgroundWorker *autoprewarm, const char *worker_name,
+			   const char *worker_function, Datum main_arg, int restart_time,
+				  int extra_flags)
+	MemSet(autoprewarm, 0, sizeof(BackgroundWorker));
+	autoprewarm->bgw_flags = BGWORKER_SHMEM_ACCESS | extra_flags;
+	/* Register the autoprewarm background worker */
+	autoprewarm->bgw_start_time = BgWorkerStart_ConsistentState;
+	autoprewarm->bgw_restart_time = restart_time;
+	strcpy(autoprewarm->bgw_library_name, "pg_prewarm");
+	strcpy(autoprewarm->bgw_function_name, worker_function);
+	strncpy(autoprewarm->bgw_name, worker_name, BGW_MAXLEN);
+	autoprewarm->bgw_main_arg = main_arg;
+ * _PG_init
+ *		Extension's entry point.
+ */
+	BackgroundWorker prewarm_worker;
+	/* Define custom GUC variables. */
+	DefineCustomIntVariable("pg_prewarm.dump_interval",
+					   "Sets the maximum time between two buffer pool dumps",
+							"If set to zero, timer based dumping is disabled."
+							" If set to -1, stops autoprewarm.",
+							&dump_interval,
+							AT_PWARM_OFF, INT_MAX / 1000,
+							PGC_SIGHUP,
+							GUC_UNIT_S,
+							NULL,
+							NULL,
+							NULL);
+	if (process_shared_preload_libraries_in_progress)
+		DefineCustomBoolVariable("pg_prewarm.autoprewarm",
+								 "Enable/Disable auto-prewarm feature.",
+								 NULL,
+								 &enable_autoprewarm,
+								 true,
+								 0,
+								 NULL,
+								 NULL,
+								 NULL);
+	else
+	{
+		/* If not run as a preloaded library, nothing more to do. */
+		EmitWarningsOnPlaceholders("pg_prewarm");
+		return;
+	}
+	EmitWarningsOnPlaceholders("pg_prewarm");
+	/* Request additional shared resources. */
+	RequestAddinShmemSpace(MAXALIGN(sizeof(AutoPrewarmSharedState)));
+	/* If autoprewarm bgworker is disabled then nothing more to do. */
+	if (!enable_autoprewarm)
+		return;
+	/* Register autoprewarm load. */
+	setup_autoprewarm(&prewarm_worker, "autoprewarm", "autoprewarm_main",
+					  Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0);
+	RegisterBackgroundWorker(&prewarm_worker);
+ * autoprewarm_dump_launcher
+ *		Dynamically launch an autoprewarm dump worker.
+ */
+static pid_t
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle;
+	BgwHandleStatus status;
+	pid_t		pid;
+	setup_autoprewarm(&worker, "autoprewarm", "autoprewarm_main",
+					  Int32GetDatum(TASK_DUMP_BUFFERPOOL_INFO), 0, 0);
+	/* Set bgw_notify_pid so that we can use WaitForBackgroundWorkerStartup */
+	worker.bgw_notify_pid = MyProcPid;
+	if (!RegisterDynamicBackgroundWorker(&worker, &handle))
+	{
+		ereport(ERROR,
+			   errmsg("registering dynamic bgworker \"autoprewarm\" failed"),
+				 errhint("Consider increasing configuration parameter \"max_worker_processes\".")));
+	}
+	status = WaitForBackgroundWorkerStartup(handle, &pid);
+	if (status == BGWH_STOPPED)
+	{
+		ereport(ERROR,
+				 errmsg("could not start autoprewarm dump bgworker"),
+			   errhint("More details may be available in the server log.")));
+	}
+	if (status == BGWH_POSTMASTER_DIED)
+	{
+		ereport(ERROR,
+			  errmsg("cannot start bgworker autoprewarm without postmaster"),
+				 errhint("Kill all remaining database processes and restart the database.")));
+	}
+	Assert(status == BGWH_STARTED);
+	return pid;
+ * launch_autoprewarm_dump
+ *		The C-Language entry function to launch autoprewarm dump bgworker.
+ */
+	pid_t		pid;
+	/* If dump_interval is disabled then nothing more to do. */
+	if (dump_interval == AT_PWARM_OFF)
+	pid = autoprewarm_dump_launcher();
+	PG_RETURN_INT32(pid);
+ * autoprewarm_dump_now
+ *		The C-Language entry function to dump immediately.
+ */
+	uint32		num_blocks = 0;
+	init_apw_state();
+	PG_TRY();
+	{
+		num_blocks = dump_now(false);
+	}
+	{
+		if (apw_state->pid_using_dumpfile == MyProcPid)
+			apw_state->pid_using_dumpfile = InvalidPid;
+	}
+	PG_RETURN_INT64(num_blocks);
diff --git a/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
new file mode 100644
index 0000000..a2241c6
--- /dev/null
+++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
@@ -0,0 +1,14 @@
+/* contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql */
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_prewarm UPDATE TO '1.2'" to load this file. \quit
+CREATE FUNCTION launch_autoprewarm_dump()
+RETURNS pg_catalog.int4 STRICT
+AS 'MODULE_PATHNAME', 'launch_autoprewarm_dump'
+CREATE FUNCTION autoprewarm_dump_now()
+RETURNS pg_catalog.int8 STRICT
+AS 'MODULE_PATHNAME', 'autoprewarm_dump_now'
diff --git a/contrib/pg_prewarm/pg_prewarm.control b/contrib/pg_prewarm/pg_prewarm.control
index cf2fb92..40e3add 100644
--- a/contrib/pg_prewarm/pg_prewarm.control
+++ b/contrib/pg_prewarm/pg_prewarm.control
@@ -1,5 +1,5 @@
 # pg_prewarm extension
 comment = 'prewarm relation data'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/pg_prewarm'
 relocatable = true
diff --git a/doc/src/sgml/pgprewarm.sgml b/doc/src/sgml/pgprewarm.sgml
index c090401..7f1972d 100644
--- a/doc/src/sgml/pgprewarm.sgml
+++ b/doc/src/sgml/pgprewarm.sgml
@@ -10,7 +10,9 @@
   The <filename>pg_prewarm</filename> module provides a convenient way
   to load relation data into either the operating system buffer cache
-  or the <productname>PostgreSQL</productname> buffer cache.
+  or the <productname>PostgreSQL</productname> buffer cache. Additionally, an
+  automatic prewarming of the server buffers is supported whenever the server
+  restarts.
@@ -55,6 +57,103 @@ pg_prewarm(regclass, mode text default 'buffer', fork text default 'main',
    cache. For these reasons, prewarming is typically most useful at startup,
    when caches are largely empty.
+launch_autoprewarm_dump() RETURNS int4
+  <para>
+   This will launch the <literal>autoprewarm</literal> worker which will dump
+   shared buffers to disk at the interval specified by
+   <varname>pg_prewarm.dump_interval</varname>.  The return value is the
+   process ID of the autoprewarm worker.  As only one
+   <literal>autoprewarm</literal> worker can be run per cluster at a time,
+   additional invokations will return a process ID, but that process will
+   immediately exit.
+  </para>
+autoprewarm_dump_now() RETURNS int8
+  <para>
+   This will immediately dump shared buffers to disk.  The return value is
+   the number of blocks dumped.
+  </para>
+ </sect2>
+ <sect2>
+  <title>autoprewarm</title>
+  <para>
+  This is a background worker process which will automatically dump shared
+  buffers to disk before a shutdown and then prewarm shared buffers the
+  next time the server is started by loading blocks from disk back into
+  the buffer pool.
+  </para>
+  <para>
+  When the shared library <literal>pg_prewarm</literal> is preloaded via
+  <xref linkend="guc-shared-preload-libraries"> in <filename>postgresql.conf</>,
+  an <literal>autoprewarm</literal> background worker is launched immediately after the
+  server has reached a consistent state. The autoprewarm process will start loading blocks
+  recorded in <filename>$PGDATA/autoprewarm.blocks</filename> until there is a
+  free buffer left in the buffer pool. This way we do not replace any new
+  blocks which were loaded either by the recovery process or the querying
+  clients.
+  </para>
+  <para>
+  Once the <literal>autoprewarm</literal> process has finished loading buffers
+  from disk, it will periodically dump shared buffers to disk at the inverval
+  specified by <varname>pg_prewarm.dump_interval</varname>.  Upon the next
+  server restart, the autoprewarm process will prewarm shared buffers with the
+  blocks that were last dumped to disk.
+  </para>
+ </sect2>
+ <sect2>
+  <title>Configuration Parameters</title>
+ <variablelist>
+   <varlistentry>
+    <term>
+     <varname>pg_prewarm.enable_autoprewarm</varname> (<type>boolean</type>)
+     <indexterm>
+      <primary><varname>pg_prewarm.enable_autoprewarm</> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      If set to <literal>on<literal>, an autoprewarm worker will be started
+      upon server start.  Setting this to <literal>off</literal> disables it.
+      The default value is <literal>on</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+  <variablelist>
+   <varlistentry>
+   <term>
+     <varname>pg_prewarm.dump_interval</varname> (<type>int</type>)
+     <indexterm>
+      <primary><varname>pg_prewarm.dump_interval</> configuration parameter</primary>
+     </indexterm>
+    </term>
+    <listitem>
+     <para>
+      This is the minimum number of seconds after which autoprewarm dumps
+      shared buffers to disk.  The default is 300 seconds.  If set to 0,
+      shared buffers will not be dumped at regular intervals, only when the
+      server is shut down.
+      If set to -1, the running <literal>autoprewarm</literal> process will
+      be stopped.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 9d8ae6a..f033323 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -169,6 +169,23 @@ ClockSweepTick(void)
+ * have_free_buffer -- a lockless check to see if there is a free buffer in
+ *					   buffer pool.
+ *
+ * If the result is true that will become stale once free buffers are moved out
+ * by other operations, so the caller who strictly want to use a free buffer
+ * should not call this.
+ */
+	if (StrategyControl->firstFreeBuffer >= 0)
+		return true;
+	else
+		return false;
  * StrategyGetBuffer
  *	Called by the bufmgr to get the next candidate buffer to use in
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b768b6f..300adfc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -317,6 +317,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
+extern bool have_free_buffer(void);
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23a4bbd..8785b3b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -138,6 +138,8 @@ AttrDefault
@@ -214,10 +216,12 @@ BitmapOr
@@ -2870,6 +2874,7 @@ pos_trgm
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to