On 22 June 2017 at 22:52, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun...@enterprisedb.com> > wrote: >> [ new patch ] > > I think this is looking better. I have some suggestions: > > * I suggest renaming launch_autoprewarm_dump() to > autoprewarm_start_worker(). I think that will be clearer. Remember > that user-visible names, internal names, and the documentation should > all match.
+1 I like related functions and GUCs to be similarly named so that they have the same prefix. > > * I think the GUC could be pg_prewarm.autoprewarm rather than > pg_prewarm.enable_autoprewarm. It's shorter and, I think, no less > clear. +1 I also think pg_prewarm.dump_interval should be renamed to pg_prewarm.autoprewarm_interval. > > * In the documentation, don't say "This is a SQL callable function > to....". This is a list of SQL-callable functions, so each thing in > the list is one. Just delete this from the beginning of each > sentence. I've made a pass at the documentation and ended up removing those intros. I haven't made any GUC/function renaming changes, but I have rewritten some paragraphs for clarity. Updated patch attached. One thing I couldn't quite make sense of is: "The autoprewarm process will start loading blocks recorded in $PGDATA/autoprewarm.blocks until there is a free buffer left in the buffer pool." Is this saying "until there is a single free buffer remaining in shared buffers"? I haven't corrected or clarified this as I don't understand it. Also, I find it a bit messy that launch_autoprewarm_dump() doesn't detect an autoprewarm process already running. I'd want this to return NULL or an error if called for a 2nd time. > > * The reason for the AT_PWARM_* naming is not very obvious. Does AT > mean "at" or "auto" or something else? How about > AUTOPREWARM_INTERVAL_DISABLED, AUTOPREWARM_INTERVAL_SHUTDOWN_ONLY, > AUTOPREWARM_INTERVAL_DEFAULT? > > * Instead of defining apw_sigusr1_handler, I think you could just use > procsignal_sigusr1_handler. Instead of defining apw_sigterm_handler, > perhaps you could just use die(). got_sigterm would go away, and > you'd just CHECK_FOR_INTERRUPTS(). > > * The PG_TRY()/PG_CATCH() block in autoprewarm_dump_now() could reuse > reset_apw_state(), which might be better named detach_apw_shmem(). > Similarly, init_apw_state() could be init_apw_shmem(). > > * Instead of load_one_database(), I suggest > autoprewarm_database_main(). That is more parallel to > autoprewarm_main(), which you have elsewhere, and makes it more > obvious that it's the main entrypoint for some background worker. > > * Instead of launch_and_wait_for_per_database_worker(), I suggest > autoprewarm_one_database(), and instead of prewarm_buffer_pool(), I > suggest autoprewarm_buffers(). The motivation for changing prewarm > to autoprewarm is that we want the names here to be clearly distinct > from the other parts of pg_prewarm that are not related to > autoprewarm. The motivation for changing buffer_pool to buffers is > just that it's a little shorter. Personally I also like the sound it > of it better, but YMMV. > > * prewarm_buffer_pool() ends with a useless return statement. I > suggest removing it. > > * Instead of creating our own buffering system via buffer_file_write() > and buffer_file_flush(), why not just use the facilities provided by > the operating system? fopen() et. al. provide buffering, and we have > AllocateFile() to provide a FILE *; it's just like > OpenTransientFile(), which you are using, but you'll get the buffering > stuff for free. Maybe there's some reason why this won't work out > nicely, but off-hand it seems like it might. It looks like you are > already using AllocateFile() to read the dump, so using it to write > the dump as well seems like it would be logical. > > * I think that it would be cool if, when autoprewarm completed, it > printed a message at LOG rather than DEBUG1, and with a few more > details, like "autoprewarm successfully prewarmed %d of %d > previously-loaded blocks". This would require some additional > tracking that you haven't got right now; you'd have to keep track not > only of the number of blocks read from the file but how many of those > some worker actually loaded. You could do that with an extra counter > in the shared memory area that gets incremented by the per-database > workers. > > * dump_block_info_periodically() calls ResetLatch() immediately before > WaitLatch; that's backwards. See the commit message for commit > 887feefe87b9099eeeec2967ec31ce20df4dfa9b and the comments it added to > the top of latch.h for details on how to do this correctly. > > * dump_block_info_periodically()'s main loop is a bit confusing. I > think that after calling dump_now(true) it should just "continue", > which will automatically retest got_sigterm. You could rightly object > to that plan on the grounds that we then wouldn't recheck got_sighup > promptly, but you can fix that by moving the got_sighup test to the > top of the loop, which is a good idea anyway for at least two other > reasons. First, you probably want to check for a pending SIGHUP on > initially entering this function, because something might have changed > during the prewarm phase, and second, see the previous comment about > using the "another valid coding pattern" from latch.h, which puts the > ResetLatch() at the bottom of the loop. > > * I think that launch_autoprewarm_dump() should ereport(ERROR, ...) > rather than just return NULL if the feature is disabled. Maybe > something like ... ERROR: pg_prewarm.dump_interval must be > non-negative in order to launch worker > > * Not sure about this one, but maybe we should consider just getting > rid of pg_prewarm.dump_interval = -1 altogether and make the minimum > value 0. If pg_prewarm.autoprewarm = on, then we start the worker and > dump according to the dump interval; if pg_prewarm.autoprewarm = off > then we don't start the worker automatically, but we still let you > start it manually. If you do, it respects the configured > dump_interval. With this design, we don't need the error suggested in > the previous item at all, and the code can be simplified in various > places --- all the checks for AT_PWARM_OFF go away. And I don't see > that we're really losing anything. There's not much sense in dumping > but not prewarming or prewarming but not dumping, so having > pg_prewarm.autoprewarm configure whether the worker is started > automatically rather than whether it prewarms (with a separate control > for whether it dumps) seems to make sense. The one time when you want > to do one without the other is when you first install the extension -- > during the first server lifetime, you'll want to dump, so that after > the next restart you have something to preload. But this design would > allow that. -- Thom
diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile index 7ad941e..88580d1 100644 --- a/contrib/pg_prewarm/Makefile +++ b/contrib/pg_prewarm/Makefile @@ -1,10 +1,10 @@ # contrib/pg_prewarm/Makefile MODULE_big = pg_prewarm -OBJS = pg_prewarm.o $(WIN32RES) +OBJS = pg_prewarm.o autoprewarm.o $(WIN32RES) EXTENSION = pg_prewarm -DATA = pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql +DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache" ifdef USE_PGXS diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c new file mode 100644 index 0000000..f84fa4a --- /dev/null +++ b/contrib/pg_prewarm/autoprewarm.c @@ -0,0 +1,1109 @@ +/*------------------------------------------------------------------------- + * + * autoprewarm.c + * Automatically prewarms the shared buffer pool when server restarts. + * + * DESCRIPTION + * + * Autoprewarm is a bgworker process that automatically records the + * information about blocks which were present in buffer pool before + * server shutdown. Then prewarms the buffer pool on server restart + * with those blocks. + * + * How does it work? When the shared library "pg_prewarm" is preloaded, a + * bgworker "autoprewarm" is launched immediately after the server has + * reached a consistent state. The bgworker will start loading blocks + * recorded until there is no free buffer left in the buffer pool. This + * way we do not replace any new blocks which were loaded either by the + * recovery process or the querying clients. + * + * Once the "autoprewarm" bgworker has completed its prewarm task, it will + * start a new task to periodically dump the BlockInfoRecords related to + * the blocks which are currently in shared buffer pool. On next server + * restart, the bgworker will prewarm the buffer pool by loading those + * blocks. The GUC pg_prewarm.dump_interval will control the dumping + * activity of the bgworker. + * + * Copyright (c) 2016-2017, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/pg_prewarm/autoprewarm.c + *------------------------------------------------------------------------- + */ + +#include "postgres.h" +#include <unistd.h> + +/* These are always necessary for a bgworker. */ +#include "miscadmin.h" +#include "postmaster/bgworker.h" +#include "storage/ipc.h" +#include "storage/latch.h" +#include "storage/lwlock.h" +#include "storage/proc.h" +#include "storage/shmem.h" + +/* These are necessary for prewarm utilities. */ +#include "access/heapam.h" +#include "access/xact.h" +#include "catalog/pg_class.h" +#include "catalog/pg_type.h" +#include "pgstat.h" +#include "storage/buf_internals.h" +#include "storage/dsm.h" +#include "storage/smgr.h" +#include "utils/acl.h" +#include "utils/guc.h" +#include "utils/memutils.h" +#include "utils/rel.h" +#include "utils/relfilenodemap.h" +#include "utils/resowner.h" + +PG_FUNCTION_INFO_V1(launch_autoprewarm_dump); +PG_FUNCTION_INFO_V1(autoprewarm_dump_now); + +#define AT_PWARM_OFF -1 +#define AT_PWARM_DUMP_AT_SHUTDOWN_ONLY 0 +#define AT_PWARM_DEFAULT_DUMP_INTERVAL 300 + +#define AUTOPREWARM_FILE "autoprewarm.blocks" + +/* Primary functions */ +void _PG_init(void); +void autoprewarm_main(Datum main_arg); +static void dump_block_info_periodically(void); +static pid_t autoprewarm_dump_launcher(void); +static void setup_autoprewarm(BackgroundWorker *autoprewarm, + const char *worker_name, + const char *worker_function, + Datum main_arg, int restart_time, + int extra_flags); +void load_one_database(Datum main_arg); + +/* + * Signal Handlers. + */ + +static void apw_sigterm_handler(SIGNAL_ARGS); +static void apw_sighup_handler(SIGNAL_ARGS); +static void apw_sigusr1_handler(SIGNAL_ARGS); + +/* Flags set by signal handlers */ +static volatile sig_atomic_t got_sigterm = false; +static volatile sig_atomic_t got_sighup = false; + +/* + * Signal handler for SIGTERM + * Set a flag to handle. + */ +static void +apw_sigterm_handler(SIGNAL_ARGS) +{ + int save_errno = errno; + + got_sigterm = true; + + if (MyProc) + SetLatch(&MyProc->procLatch); + + errno = save_errno; +} + +/* + * Signal handler for SIGHUP + * Set a flag to reread the config file. + */ +static void +apw_sighup_handler(SIGNAL_ARGS) +{ + int save_errno = errno; + + got_sighup = true; + + if (MyProc) + SetLatch(&MyProc->procLatch); + + errno = save_errno; +} + +/* + * Signal handler for SIGUSR1. + * The prewarm workers notify with SIGUSR1 on their startup/shutdown. + */ +static void +apw_sigusr1_handler(SIGNAL_ARGS) +{ + int save_errno = errno; + + if (MyProc) + SetLatch(&MyProc->procLatch); + + errno = save_errno; +} + +/* ============================================================================ + * ============== Types and variables used by autoprewarm ============= + * ============================================================================ + */ + +/* Metadata of each persistent block which is dumped and used for loading. */ +typedef struct BlockInfoRecord +{ + Oid database; + Oid tablespace; + Oid filenode; + ForkNumber forknum; + BlockNumber blocknum; +} BlockInfoRecord; + +/* Tasks performed by autoprewarm workers.*/ +typedef enum +{ + TASK_PREWARM_BUFFERPOOL, /* prewarm the buffer pool. */ + TASK_DUMP_BUFFERPOOL_INFO /* dump the buffer pool block info. */ +} AutoPrewarmTask; + +/* Shared state information for autoprewarm bgworker. */ +typedef struct AutoPrewarmSharedState +{ + LWLock lock; /* mutual exclusion */ + pid_t bgworker_pid; /* for main bgworker */ + pid_t pid_using_dumpfile; /* for autoprewarm or block dump */ + bool skip_prewarm_on_restart; /* if set true, prewarm task + * will not be done */ + + /* Following items are for communication with per-database worker */ + dsm_handle block_info_handle; + Oid database; + int prewarm_start_idx; + int prewarm_stop_idx; +} AutoPrewarmSharedState; + +static AutoPrewarmSharedState *apw_state = NULL; + +/* + * This data structure represents buffered file. + */ +typedef struct BufferFile +{ + char transient_dump_file_path[MAXPGPATH]; /* actual file to be + * written */ + int fd; /* file descriptor to above file */ + char buf[BLCKSZ]; /* buffer used before writing to file */ + int pos; /* next write position in buffer. */ +} BufferFile; + +/* GUC variable that controls the dump activity of autoprewarm. */ +static int dump_interval = 0; + +/* + * GUC variable to decide whether autoprewarm worker should be started when + * preloaded. + */ +static bool enable_autoprewarm = true; + +/* Compare member elements to check whether they are not equal. */ +#define cmp_member_elem(fld) \ +do { \ + if (a->fld < b->fld) \ + return -1; \ + else if (a->fld > b->fld) \ + return 1; \ +} while(0); + +/* + * blockinfo_cmp + * Compare function used for qsort(). + */ +static int +blockinfo_cmp(const void *p, const void *q) +{ + BlockInfoRecord *a = (BlockInfoRecord *) p; + BlockInfoRecord *b = (BlockInfoRecord *) q; + + cmp_member_elem(database); + cmp_member_elem(tablespace); + cmp_member_elem(filenode); + cmp_member_elem(forknum); + cmp_member_elem(blocknum); + return 0; +} + +/* ============================================================================ + * ===================== Prewarm part of autoprewarm ======================= + * ============================================================================ + */ + +/* + * reset_apw_state + * on_apw_exit reset the prewarm state + */ + +static void +reset_apw_state(int code, Datum arg) +{ + if (apw_state->pid_using_dumpfile == MyProcPid) + apw_state->pid_using_dumpfile = InvalidPid; + if (apw_state->bgworker_pid == MyProcPid) + apw_state->bgworker_pid = InvalidPid; +} + +/* + * init_apw_state + * Allocate and initialize autoprewarm related shared memory. + */ +static void +init_apw_state(void) +{ + bool found = false; + + LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); + apw_state = ShmemInitStruct("autoprewarm", + sizeof(AutoPrewarmSharedState), + &found); + if (!found) + { + /* First time through ... */ + LWLockInitialize(&apw_state->lock, LWLockNewTrancheId()); + apw_state->bgworker_pid = InvalidPid; + apw_state->pid_using_dumpfile = InvalidPid; + apw_state->skip_prewarm_on_restart = false; + } + + LWLockRelease(AddinShmemInitLock); +} + +/* + * load_one_database + * This subroutine loads the BlockInfoRecords of the database set in + * AutoPrewarmSharedState. + * + * Connect to the database and load the blocks of that database which are given + * by [apw_state->prewarm_start_idx, apw_state->prewarm_stop_idx). + */ +void +load_one_database(Datum main_arg) +{ + uint32 pos; + BlockInfoRecord *block_info; + Relation rel = NULL; + BlockNumber nblocks = 0; + BlockInfoRecord *old_blk; + dsm_segment *seg; + + /* Establish signal handlers before unblocking signals. */ + pqsignal(SIGTERM, apw_sigterm_handler); + pqsignal(SIGHUP, apw_sighup_handler); + + /* We're now ready to receive signals */ + BackgroundWorkerUnblockSignals(); + + init_apw_state(); + seg = dsm_attach(apw_state->block_info_handle); + if (seg == NULL) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("could not map dynamic shared memory segment"))); + + block_info = (BlockInfoRecord *) dsm_segment_address(seg); + + BackgroundWorkerInitializeConnectionByOid(apw_state->database, InvalidOid); + old_blk = NULL; + pos = apw_state->prewarm_start_idx; + + while (!got_sigterm && pos < apw_state->prewarm_stop_idx && + have_free_buffer()) + { + BlockInfoRecord *blk = &block_info[pos++]; + Buffer buf; + + /* + * Quit if we've reached records for another database. If previous + * blocks are of some global objects, then continue pre-warming. + */ + if (old_blk != NULL && old_blk->database != blk->database && + old_blk->database != 0) + break; + + /* + * As soon as we encounter a block of a new relation, close the old + * relation. Note, that rel will be NULL if try_relation_open failed + * previously, in that case there is nothing to close. + */ + if (old_blk != NULL && old_blk->filenode != blk->filenode && + rel != NULL) + { + relation_close(rel, AccessShareLock); + rel = NULL; + CommitTransactionCommand(); + } + + /* + * Try to open each new relation, but only once, when we first + * encounter it. If it's been dropped, skip the associated blocks. + */ + if (old_blk == NULL || old_blk->filenode != blk->filenode) + { + Oid reloid; + + Assert(rel == NULL); + StartTransactionCommand(); + reloid = RelidByRelfilenode(blk->tablespace, blk->filenode); + if (OidIsValid(reloid)) + rel = try_relation_open(reloid, AccessShareLock); + + if (!rel) + CommitTransactionCommand(); + } + if (!rel) + { + old_blk = blk; + continue; + } + + /* Once per fork, check for fork existence and size. */ + if (old_blk == NULL || + old_blk->filenode != blk->filenode || + old_blk->forknum != blk->forknum) + { + RelationOpenSmgr(rel); + + /* + * smgrexists is not safe for illegal forknum, hence check whether + * the passed forknum is valid before using it in smgrexists. + */ + if (blk->forknum > InvalidForkNumber && + blk->forknum <= MAX_FORKNUM && + smgrexists(rel->rd_smgr, blk->forknum)) + nblocks = RelationGetNumberOfBlocksInFork(rel, blk->forknum); + else + nblocks = 0; + } + + /* Check whether blocknum is valid and within fork file size. */ + if (blk->blocknum >= nblocks) + { + /* Move to next forknum. */ + old_blk = blk; + continue; + } + + /* Prewarm buffer. */ + buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL, + NULL); + if (BufferIsValid(buf)) + ReleaseBuffer(buf); + + old_blk = blk; + } + + dsm_detach(seg); + + /* Release lock on previous relation. */ + if (rel) + { + relation_close(rel, AccessShareLock); + CommitTransactionCommand(); + } + + return; +} + +/* + * launch_and_wait_for_per_database_worker + * Register a per-database dynamic worker to load. + */ +static void +launch_and_wait_for_per_database_worker(void) +{ + BackgroundWorker worker; + BackgroundWorkerHandle *handle = NULL; + BgwHandleStatus status PG_USED_FOR_ASSERTS_ONLY; + + setup_autoprewarm(&worker, "autoprewarm", "load_one_database", + (Datum) NULL, BGW_NEVER_RESTART, + BGWORKER_BACKEND_DATABASE_CONNECTION); + + /* Set bgw_notify_pid so that we can use WaitForBackgroundWorkerShutdown */ + worker.bgw_notify_pid = MyProcPid; + + if (!RegisterDynamicBackgroundWorker(&worker, &handle)) + { + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("registering dynamic bgworker autoprewarm failed"), + errhint("Consider increasing configuration parameter \"max_worker_processes\"."))); + } + + status = WaitForBackgroundWorkerShutdown(handle); + Assert(status == BGWH_STOPPED); +} + +/* + * prewarm_buffer_pool + * The main routine that prewarms the buffer pool. + * + * The prewarm bgworker will first load all the BlockInfoRecords in + * $PGDATA/AUTOPREWARM_FILE to a DSM. Further, these BlockInfoRecords are + * separated based on their databases. Finally, for each group of + * BlockInfoRecords a per-database worker will be launched to load the + * corresponding blocks. Launch the next worker only after the previous one has + * finished its job. + */ +static void +prewarm_buffer_pool(void) +{ + FILE *file = NULL; + uint32 num_elements, + i; + BlockInfoRecord *blkinfo; + dsm_segment *seg; + + /* + * Since there can be at most one worker for prewarm, locking is not + * required for setting skip_prewarm_on_restart. + */ + apw_state->skip_prewarm_on_restart = true; + + LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE); + if (apw_state->pid_using_dumpfile == InvalidPid) + apw_state->pid_using_dumpfile = MyProcPid; + else + { + LWLockRelease(&apw_state->lock); + ereport(LOG, + (errmsg("skipping prewarm because block dump file is being written by PID %d", + apw_state->pid_using_dumpfile))); + return; + } + + LWLockRelease(&apw_state->lock); + + file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R); + if (!file) + { + if (errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", + AUTOPREWARM_FILE))); + + apw_state->pid_using_dumpfile = InvalidPid; + return; /* No file to load. */ + } + + if (fscanf(file, "<<%u>>i\n", &num_elements) != 1) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read from file \"%s\": %m", + AUTOPREWARM_FILE))); + } + + seg = dsm_create(sizeof(BlockInfoRecord) * num_elements, 0); + + blkinfo = (BlockInfoRecord *) dsm_segment_address(seg); + + for (i = 0; i < num_elements; i++) + { + /* Get next block. */ + if (5 != fscanf(file, "%u,%u,%u,%u,%u\n", &blkinfo[i].database, + &blkinfo[i].tablespace, &blkinfo[i].filenode, + (uint32 *) &blkinfo[i].forknum, &blkinfo[i].blocknum)) + break; + } + + FreeFile(file); + + if (num_elements != i) + elog(ERROR, "autoprewarm block dump has %u entries but expected %u", + i, num_elements); + + /* + * Sort the block number to increase the chance of sequential reads during + * load. + */ + pg_qsort(blkinfo, num_elements, sizeof(BlockInfoRecord), blockinfo_cmp); + + apw_state->block_info_handle = dsm_segment_handle(seg); + apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx = 0; + + /* Get the info position of the first block of the next database. */ + while (apw_state->prewarm_start_idx < num_elements) + { + uint32 i = apw_state->prewarm_start_idx; + Oid current_db = blkinfo[i].database; + + /* + * Advance the prewarm_stop_idx to the first BlockRecordInfo that does + * not belong to this database. + */ + i++; + while (i < num_elements) + { + if (current_db != blkinfo[i].database) + { + /* + * Combine BlockRecordInfos of global object with the next + * non-global object. + */ + if (current_db != InvalidOid) + break; + current_db = blkinfo[i].database; + } + + i++; + } + + /* + * If we reach this point with current_db == InvalidOid, then only + * BlockRecordInfos belonging to global objects exist. Since, we can + * not connect with InvalidOid skip prewarming for these objects. + */ + if (current_db == InvalidOid) + break; + + apw_state->prewarm_stop_idx = i; + apw_state->database = current_db; + + Assert(apw_state->prewarm_start_idx < apw_state->prewarm_stop_idx); + + /* + * Register a per-database worker to load blocks of the database. Wait + * until it has finished before starting the next worker. + */ + launch_and_wait_for_per_database_worker(); + apw_state->prewarm_start_idx = apw_state->prewarm_stop_idx; + } + + dsm_detach(seg); + apw_state->block_info_handle = DSM_HANDLE_INVALID; + + apw_state->pid_using_dumpfile = InvalidPid; + ereport(DEBUG1, + (errmsg("autoprewarm load task ended"))); + return; +} + +/* + * ============================================================================ + * ===================== Dump part of Autoprewarm ============================= + * ============================================================================ + */ + +/* + * This submodule is for periodically dumping BlockRecordInfos in buffer pool + * into a dump file AUTOPREWARM_FILE. + * Each entry of BlockRecordInfo consists of database, tablespace, filenode, + * forknum, blocknum. Note that this is in the text form so that the dump + * information is readable and can be edited, if required. + */ + +/* + * buffer_file_flush + * Unload the buffer contents to actual file. + * + */ +static void +buffer_file_flush(BufferFile * file) +{ + ssize_t w_size; + char *buf = file->buf; + + while (file->pos) + { + /* write to file until an error */ + w_size = write(file->fd, buf, file->pos); + if (w_size > 0) + { + file->pos -= w_size; + buf += w_size; + } + else + { + int save_errno = errno; + + CloseTransientFile(file->fd); + unlink(file->transient_dump_file_path); + errno = save_errno; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\" : %m", + file->transient_dump_file_path))); + } + } +} + +/* + * buffer_file_write + * First accumulate the contents in a BLCKSZ buffer then unload it to + * actual file. + */ +static void +buffer_file_write(BufferFile * file, char *block_info, int block_info_len) +{ + Assert(block_info_len <= BLCKSZ); + + /* If we exceed the buffer size unload buffer to actual file. */ + if ((file->pos + block_info_len) > BLCKSZ) + buffer_file_flush(file); + + memcpy(file->buf + file->pos, block_info, block_info_len); + file->pos += block_info_len; +} + +/* + * dump_now + * Dumps BlockRecordInfos in buffer pool. + */ +static uint32 +dump_now(bool is_bgworker) +{ + uint32 i; + int ret, + block_info_len; + uint32 num_blocks; + BlockInfoRecord *block_info_array; + BufferDesc *bufHdr; + BufferFile *file; + char block_info[1024]; + + LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE); + if (apw_state->pid_using_dumpfile == InvalidPid) + apw_state->pid_using_dumpfile = MyProcPid; + else + { + LWLockRelease(&apw_state->lock); + + if (!is_bgworker) + ereport(ERROR, + (errmsg("could not perform block dump because dump file is being used by PID %d", + apw_state->pid_using_dumpfile))); + ereport(LOG, + (errmsg("skipping block dump because it is already being performed by PID %d", + apw_state->pid_using_dumpfile))); + return 0; + } + + LWLockRelease(&apw_state->lock); + + block_info_array = + (BlockInfoRecord *) palloc(sizeof(BlockInfoRecord) * NBuffers); + + for (num_blocks = 0, i = 0; i < NBuffers; i++) + { + uint32 buf_state; + + /* In case of a SIGHUP, just reload the configuration. */ + if (got_sighup) + { + got_sighup = false; + ProcessConfigFile(PGC_SIGHUP); + } + + /* Have we been asked to stop dump? */ + if (dump_interval == AT_PWARM_OFF) + { + pfree(block_info_array); + return 0; + } + + bufHdr = GetBufferDescriptor(i); + + /* Lock each buffer header before inspecting. */ + buf_state = LockBufHdr(bufHdr); + + if (buf_state & BM_TAG_VALID) + { + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode; + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode; + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode; + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum; + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum; + ++num_blocks; + } + + UnlockBufHdr(bufHdr, buf_state); + } + + file = (BufferFile *) palloc(sizeof(BufferFile)); + snprintf(file->transient_dump_file_path, MAXPGPATH, "%s.tmp", + AUTOPREWARM_FILE); + + file->fd = OpenTransientFile(file->transient_dump_file_path, + O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY, 0666); + if (file->fd < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open \"%s\": %m", + file->transient_dump_file_path))); + file->pos = 0; + + block_info_len = sprintf(block_info, "<<%u>>\n", num_blocks); + buffer_file_write(file, block_info, block_info_len); + + for (i = 0; i < num_blocks; i++) + { + /* In case of a SIGHUP, just reload the configuration. */ + if (got_sighup) + { + got_sighup = false; + ProcessConfigFile(PGC_SIGHUP); + } + + /* Have we been asked to stop dump? */ + if (dump_interval == AT_PWARM_OFF) + { + pfree(block_info_array); + CloseTransientFile(file->fd); + unlink(file->transient_dump_file_path); + pfree(file); + return 0; + } + + block_info_len = sprintf(block_info, "%u,%u,%u,%u,%u\n", + block_info_array[i].database, + block_info_array[i].tablespace, + block_info_array[i].filenode, + (uint32) block_info_array[i].forknum, + block_info_array[i].blocknum); + + buffer_file_write(file, block_info, block_info_len); + } + + pfree(block_info_array); + + /* Write remaining buffer contents to actual file. */ + buffer_file_flush(file); + + /* + * Rename transient_dump_file_path to AUTOPREWARM_FILE to make things + * permanent. + */ + ret = CloseTransientFile(file->fd); + if (ret != 0) + { + int save_errno = errno; + + unlink(file->transient_dump_file_path); + errno = save_errno; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\" : %m", + file->transient_dump_file_path))); + } + + (void) durable_rename(file->transient_dump_file_path, AUTOPREWARM_FILE, + ERROR); + pfree(file); + apw_state->pid_using_dumpfile = InvalidPid; + + ereport(DEBUG1, + (errmsg("saved metadata info of %d blocks", num_blocks))); + return num_blocks; +} + +/* + * dump_block_info_periodically + * This loop periodically call dump_now(). + * + * Call dum_now() at regular intervals defined by GUC variable dump_interval. + */ +void +dump_block_info_periodically(void) +{ + TimestampTz last_dump_time = 0; + + while (!got_sigterm) + { + int rc; + struct timeval nap; + + nap.tv_sec = AT_PWARM_DEFAULT_DUMP_INTERVAL; + nap.tv_usec = 0; + + /* Have we been asked to stop dumping? */ + if (dump_interval == AT_PWARM_OFF) + return; + + if (dump_interval > AT_PWARM_DUMP_AT_SHUTDOWN_ONLY) + { + TimestampTz current_time = GetCurrentTimestamp(); + + if (last_dump_time == 0 || + TimestampDifferenceExceeds(last_dump_time, + current_time, + (dump_interval * 1000))) + { + dump_now(true); + + /* + * It is better to stop when shutdown signal is received + * during or right after a dump. + */ + if (got_sigterm) + return; + last_dump_time = GetCurrentTimestamp(); + nap.tv_sec = dump_interval; + nap.tv_usec = 0; + } + else + { + long secs; + int usecs; + + TimestampDifference(last_dump_time, current_time, + &secs, &usecs); + nap.tv_sec = dump_interval - secs; + nap.tv_usec = 0; + } + } + else + last_dump_time = 0; + + ResetLatch(&MyProc->procLatch); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L), + PG_WAIT_EXTENSION); + + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); + + /* In case of a SIGHUP, just reload the configuration. */ + if (got_sighup) + { + got_sighup = false; + ProcessConfigFile(PGC_SIGHUP); + } + } + + /* It's time for postmaster shutdown, let's dump for one last time. */ + if (dump_interval != AT_PWARM_OFF) + dump_now(true); +} + +/* + * autoprewarm_main + * The main entry point of autoprewarm bgworker process. + */ +void +autoprewarm_main(Datum main_arg) +{ + AutoPrewarmTask todo_task; + + /* Establish signal handlers before unblocking signals. */ + pqsignal(SIGTERM, apw_sigterm_handler); + pqsignal(SIGHUP, apw_sighup_handler); + pqsignal(SIGUSR1, apw_sigusr1_handler); + + /* We're now ready to receive signals. */ + BackgroundWorkerUnblockSignals(); + + todo_task = DatumGetInt32(main_arg); + Assert(todo_task == TASK_PREWARM_BUFFERPOOL || + todo_task == TASK_DUMP_BUFFERPOOL_INFO); + init_apw_state(); + + LWLockAcquire(&apw_state->lock, LW_EXCLUSIVE); + if (apw_state->bgworker_pid != InvalidPid) + { + LWLockRelease(&apw_state->lock); + ereport(LOG, + (errmsg("autoprewarm worker is already running under PID %d", + apw_state->bgworker_pid))); + return; + } + + apw_state->bgworker_pid = MyProcPid; + LWLockRelease(&apw_state->lock); + + on_shmem_exit(reset_apw_state, 0); + + ereport(LOG, + (errmsg("autoprewarm worker started"))); + + /* + * We have finished initializing worker's state, let's start actual work. + */ + if (todo_task == TASK_PREWARM_BUFFERPOOL && + !apw_state->skip_prewarm_on_restart) + prewarm_buffer_pool(); + + dump_block_info_periodically(); + + ereport(LOG, + (errmsg("autoprewarm worker stopped"))); +} + +/* ============================================================================ + * ============= Extension's entry functions/utilities =================== + * ============================================================================ + */ + +/* + * setup_autoprewarm + * A common function to initialize BackgroundWorker structure. + */ +static void +setup_autoprewarm(BackgroundWorker *autoprewarm, const char *worker_name, + const char *worker_function, Datum main_arg, int restart_time, + int extra_flags) +{ + MemSet(autoprewarm, 0, sizeof(BackgroundWorker)); + autoprewarm->bgw_flags = BGWORKER_SHMEM_ACCESS | extra_flags; + + /* Register the autoprewarm background worker */ + autoprewarm->bgw_start_time = BgWorkerStart_ConsistentState; + autoprewarm->bgw_restart_time = restart_time; + strcpy(autoprewarm->bgw_library_name, "pg_prewarm"); + strcpy(autoprewarm->bgw_function_name, worker_function); + strncpy(autoprewarm->bgw_name, worker_name, BGW_MAXLEN); + autoprewarm->bgw_main_arg = main_arg; +} + +/* + * _PG_init + * Extension's entry point. + */ +void +_PG_init(void) +{ + BackgroundWorker prewarm_worker; + + /* Define custom GUC variables. */ + + DefineCustomIntVariable("pg_prewarm.dump_interval", + "Sets the maximum time between two buffer pool dumps", + "If set to zero, timer based dumping is disabled." + " If set to -1, stops autoprewarm.", + &dump_interval, + AT_PWARM_DEFAULT_DUMP_INTERVAL, + AT_PWARM_OFF, INT_MAX / 1000, + PGC_SIGHUP, + GUC_UNIT_S, + NULL, + NULL, + NULL); + + if (process_shared_preload_libraries_in_progress) + DefineCustomBoolVariable("pg_prewarm.autoprewarm", + "Enable/Disable auto-prewarm feature.", + NULL, + &enable_autoprewarm, + true, + PGC_POSTMASTER, + 0, + NULL, + NULL, + NULL); + else + { + /* If not run as a preloaded library, nothing more to do. */ + EmitWarningsOnPlaceholders("pg_prewarm"); + return; + } + + EmitWarningsOnPlaceholders("pg_prewarm"); + + /* Request additional shared resources. */ + RequestAddinShmemSpace(MAXALIGN(sizeof(AutoPrewarmSharedState))); + + /* If autoprewarm bgworker is disabled then nothing more to do. */ + if (!enable_autoprewarm) + return; + + /* Register autoprewarm load. */ + setup_autoprewarm(&prewarm_worker, "autoprewarm", "autoprewarm_main", + Int32GetDatum(TASK_PREWARM_BUFFERPOOL), 0, 0); + RegisterBackgroundWorker(&prewarm_worker); +} + +/* + * autoprewarm_dump_launcher + * Dynamically launch an autoprewarm dump worker. + */ +static pid_t +autoprewarm_dump_launcher(void) +{ + BackgroundWorker worker; + BackgroundWorkerHandle *handle; + BgwHandleStatus status; + pid_t pid; + + setup_autoprewarm(&worker, "autoprewarm", "autoprewarm_main", + Int32GetDatum(TASK_DUMP_BUFFERPOOL_INFO), 0, 0); + + /* Set bgw_notify_pid so that we can use WaitForBackgroundWorkerStartup */ + worker.bgw_notify_pid = MyProcPid; + + if (!RegisterDynamicBackgroundWorker(&worker, &handle)) + { + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("registering dynamic bgworker \"autoprewarm\" failed"), + errhint("Consider increasing configuration parameter \"max_worker_processes\"."))); + } + + status = WaitForBackgroundWorkerStartup(handle, &pid); + if (status == BGWH_STOPPED) + { + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("could not start autoprewarm dump bgworker"), + errhint("More details may be available in the server log."))); + } + + if (status == BGWH_POSTMASTER_DIED) + { + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_RESOURCES), + errmsg("cannot start bgworker autoprewarm without postmaster"), + errhint("Kill all remaining database processes and restart the database."))); + } + + Assert(status == BGWH_STARTED); + return pid; +} + +/* + * launch_autoprewarm_dump + * The C-Language entry function to launch autoprewarm dump bgworker. + */ +Datum +launch_autoprewarm_dump(PG_FUNCTION_ARGS) +{ + pid_t pid; + + /* If dump_interval is disabled then nothing more to do. */ + if (dump_interval == AT_PWARM_OFF) + PG_RETURN_NULL(); + + pid = autoprewarm_dump_launcher(); + PG_RETURN_INT32(pid); +} + +/* + * autoprewarm_dump_now + * The C-Language entry function to dump immediately. + */ +Datum +autoprewarm_dump_now(PG_FUNCTION_ARGS) +{ + uint32 num_blocks = 0; + + init_apw_state(); + + PG_TRY(); + { + num_blocks = dump_now(false); + } + PG_CATCH(); + { + if (apw_state->pid_using_dumpfile == MyProcPid) + apw_state->pid_using_dumpfile = InvalidPid; + PG_RE_THROW(); + } + PG_END_TRY(); + PG_RETURN_INT64(num_blocks); +} diff --git a/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql new file mode 100644 index 0000000..a2241c6 --- /dev/null +++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql @@ -0,0 +1,14 @@ +/* contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_prewarm UPDATE TO '1.2'" to load this file. \quit + +CREATE FUNCTION launch_autoprewarm_dump() +RETURNS pg_catalog.int4 STRICT +AS 'MODULE_PATHNAME', 'launch_autoprewarm_dump' +LANGUAGE C; + +CREATE FUNCTION autoprewarm_dump_now() +RETURNS pg_catalog.int8 STRICT +AS 'MODULE_PATHNAME', 'autoprewarm_dump_now' +LANGUAGE C; diff --git a/contrib/pg_prewarm/pg_prewarm.control b/contrib/pg_prewarm/pg_prewarm.control index cf2fb92..40e3add 100644 --- a/contrib/pg_prewarm/pg_prewarm.control +++ b/contrib/pg_prewarm/pg_prewarm.control @@ -1,5 +1,5 @@ # pg_prewarm extension comment = 'prewarm relation data' -default_version = '1.1' +default_version = '1.2' module_pathname = '$libdir/pg_prewarm' relocatable = true diff --git a/doc/src/sgml/pgprewarm.sgml b/doc/src/sgml/pgprewarm.sgml index c090401..7f1972d 100644 --- a/doc/src/sgml/pgprewarm.sgml +++ b/doc/src/sgml/pgprewarm.sgml @@ -10,7 +10,9 @@ <para> The <filename>pg_prewarm</filename> module provides a convenient way to load relation data into either the operating system buffer cache - or the <productname>PostgreSQL</productname> buffer cache. + or the <productname>PostgreSQL</productname> buffer cache. Additionally, an + automatic prewarming of the server buffers is supported whenever the server + restarts. </para> <sect2> @@ -55,6 +57,103 @@ pg_prewarm(regclass, mode text default 'buffer', fork text default 'main', cache. For these reasons, prewarming is typically most useful at startup, when caches are largely empty. </para> + +<synopsis> +launch_autoprewarm_dump() RETURNS int4 +</synopsis> + + <para> + This will launch the <literal>autoprewarm</literal> worker which will dump + shared buffers to disk at the interval specified by + <varname>pg_prewarm.dump_interval</varname>. The return value is the + process ID of the autoprewarm worker. As only one + <literal>autoprewarm</literal> worker can be run per cluster at a time, + additional invokations will return a process ID, but that process will + immediately exit. + </para> + +<synopsis> +autoprewarm_dump_now() RETURNS int8 +</synopsis> + + <para> + This will immediately dump shared buffers to disk. The return value is + the number of blocks dumped. + </para> + </sect2> + + <sect2> + <title>autoprewarm</title> + + <para> + This is a background worker process which will automatically dump shared + buffers to disk before a shutdown and then prewarm shared buffers the + next time the server is started by loading blocks from disk back into + the buffer pool. + </para> + + <para> + When the shared library <literal>pg_prewarm</literal> is preloaded via + <xref linkend="guc-shared-preload-libraries"> in <filename>postgresql.conf</>, + an <literal>autoprewarm</literal> background worker is launched immediately after the + server has reached a consistent state. The autoprewarm process will start loading blocks + recorded in <filename>$PGDATA/autoprewarm.blocks</filename> until there is a + free buffer left in the buffer pool. This way we do not replace any new + blocks which were loaded either by the recovery process or the querying + clients. + </para> + + <para> + Once the <literal>autoprewarm</literal> process has finished loading buffers + from disk, it will periodically dump shared buffers to disk at the inverval + specified by <varname>pg_prewarm.dump_interval</varname>. Upon the next + server restart, the autoprewarm process will prewarm shared buffers with the + blocks that were last dumped to disk. + </para> + </sect2> + + <sect2> + <title>Configuration Parameters</title> + + <variablelist> + <varlistentry> + <term> + <varname>pg_prewarm.enable_autoprewarm</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>pg_prewarm.enable_autoprewarm</> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + If set to <literal>on<literal>, an autoprewarm worker will be started + upon server start. Setting this to <literal>off</literal> disables it. + The default value is <literal>on</literal>. + </para> + </listitem> + </varlistentry> + </variablelist> + + <variablelist> + <varlistentry> + <term> + <varname>pg_prewarm.dump_interval</varname> (<type>int</type>) + <indexterm> + <primary><varname>pg_prewarm.dump_interval</> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + This is the minimum number of seconds after which autoprewarm dumps + shared buffers to disk. The default is 300 seconds. If set to 0, + shared buffers will not be dumped at regular intervals, only when the + server is shut down. + If set to -1, the running <literal>autoprewarm</literal> process will + be stopped. + </para> + </listitem> + </varlistentry> + </variablelist> + </sect2> <sect2> diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 9d8ae6a..f033323 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -169,6 +169,23 @@ ClockSweepTick(void) } /* + * have_free_buffer -- a lockless check to see if there is a free buffer in + * buffer pool. + * + * If the result is true that will become stale once free buffers are moved out + * by other operations, so the caller who strictly want to use a free buffer + * should not call this. + */ +bool +have_free_buffer() +{ + if (StrategyControl->firstFreeBuffer >= 0) + return true; + else + return false; +} + +/* * StrategyGetBuffer * * Called by the bufmgr to get the next candidate buffer to use in diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index b768b6f..300adfc 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -317,6 +317,7 @@ extern void StrategyNotifyBgWriter(int bgwprocno); extern Size StrategyShmemSize(void); extern void StrategyInitialize(bool init); +extern bool have_free_buffer(void); /* buf_table.c */ extern Size BufTableShmemSize(int size); diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 23a4bbd..8785b3b 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -138,6 +138,8 @@ AttrDefault AttrNumber AttributeOpts AuthRequest +AutoPrewarmSharedState +AutoPrewarmTask AutoVacOpts AutoVacuumShmemStruct AutoVacuumWorkItem @@ -214,10 +216,12 @@ BitmapOr BitmapOrPath BitmapOrState Bitmapset +BlkType BlobInfo Block BlockId BlockIdData +BlockInfoRecord BlockNumber BlockSampler BlockSamplerData @@ -2870,6 +2874,7 @@ pos_trgm post_parse_analyze_hook_type pqbool pqsigfunc +prewarm_elem printQueryOpt printTableContent printTableFooter
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers