On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pry...@telsasoft.com> wrote: > > Rebased to appease cfbot. > > I ran these paches under a branch which shows code coverage in cirrus. It > looks good to my eyes. > https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html
thanks for doing that and for the rebase! since I made updates, the attached version 6 is also rebased. To Dmitry's question: On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote: > > I don't see it in the discussion, so naturally curious -- why directmgr > is not used for bloom index, e.g. in blbuildempty? thanks for pointing this out. blbuildempty() is also included now. bloom doesn't seem to use smgr* anywhere except blbuildempty(), so that is the only place I made changes in bloom index build. v6 has the following updates/changes: - removed erroneous extra calls to unbuffered_prep() and unbuffered_finish() in GiST and btree index builds - removed unnecessary includes - some comments were updated to accurately reflect use of directmgr - smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I think it makes sense that unbuffered_extend() of non-empty pages of WAL-logged relations (or the init fork of unlogged relations) do log_newpage(), but I wasn't so sure about smgrwrite(). Currently all callers of smgrwrite() do log_newpage() and anyone using mdwrite() will end up writing the whole page. However, it seems possible that a caller of the directmgr API might wish to do a write to a particular offset and, either because of that, or, for some other reason, require different logging than that done in log_newpage(). I didn't want to make a separate wrapper function for WAL-logging in directmgr because it felt like one more step to forget. - heapam_relation_set_new_filenode doesn't use directmgr API anymore for unlogged relations. It doesn't extend or write, so it seemed like a special case better left alone. Note that the ambuildempty() functions which also write to the init fork of an unlogged relation still use the directmgr API. It is a bit confusing because they pass do_wal=true to unbuffered_prep() even though they are unlogged relations because they must log and fsync. - interface changes to unbuffered_prep() I removed the parameters to unbuffered_prep() which required the user to specify if fsync should be added to pendingOps or done with smgrimmedsync(). Understanding all of the combinations of these parameters and when they were needed was confusing and the interface felt like a foot gun. Special cases shouldn't use this interface. I prefer the idea that users of this API expect that 1) empty pages won't be checksummed or WAL logged 2) for relations that are WAL-logged, when the build is done, the relation will be fsync'd by the backend (unless the fsync optimization is being used) 3) the only case in which fsync requests are added to the pendingOps queue is when the fsync optimization is being used (which saves the redo pointer and check it later to determine if it needs to fsync itself) I also added the parameter "do_wal" to unbuffered_prep() and the UnBufferedWriteState struct. This is used when extending the file to determine whether or not to log_newpage(). unbuffered_extend() and unbuffered_write() no longer take do_wal as a parameter. Note that callers need to pass do_wal=true/false to unbuffered_prep() based on whether or not they want log_newpage() called during unbuffered_extend()--not simply based on whether or not the relation in question is WAL-logged. do_wal is the only member of the UnBufferedWriteState struct in the first patch in the set, but I think it makes sense to keep the struct around since I anticipate that the patch containing the other members needed for the fsync optimization will be committed at the same time. One final note on unbuffered_prep() -- I am thinking of renaming "do_optimization" to "try_optimization" or maybe "request_fsync_optimization". The interface (of unbuffered_prep()) would be better if we always tried to do the optimization when relevant (when the relation is WAL-logged). - freespace map, visimap, and hash index don't use directmgr API anymore For most use cases writing/extending outside shared buffers, it isn't safe to rely on requesting fsync from checkpointer. visimap, fsm, and hash index all request fsync from checkpointer for the pages they write with smgrextend() and don't smgrimmedsync() when finished adding pages (even when the hash index is WAL-logged). Supporting these exceptions made the interface incoherent, so I cut them. - added unbuffered_extend_range() This one is a bit unfortunate. Because GiST index build uses log_newpages() to log a whole page range but calls smgrextend() for each of those pages individually, I couldn't use the unbuffered_extend() function easily. So, I thought it might be useful in other contexts as well to have a function which calls smgrextend() for a range of pages and then calls log_newpages(). I've added this. However, there are two parts of GiST index build flush ready pages that didn't work with this either. The first is that it does an error check on the block numbers one at a time while looping through them to write the pages. To retain this check, I loop through the ready pages an extra time before calling unbuffered_extend(), which is probably not acceptable. Also, GiST needs to use a custom LSN for the pages. To achieve this, I added a "custom_lsn" parameter to unbuffered_extend_range(). This isn't great either. If this was a more common case, I could pass the custom LSN to unbuffered_prep(). - Melanie
From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Fri, 4 Mar 2022 15:53:05 -0500 Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization While building a btree index, the backend can avoid fsync'ing all of the pages if it uses the optimization introduced in a prior commit. This can substantially improve performance when many indexes are being built during DDL operations. --- src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 6 +++++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 6b78acefbe..fc5cce4603 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -161,7 +161,7 @@ btbuildempty(Relation index) * internally. However, were this to be replaced with unbuffered_extend(), * do_wal must be true to ensure the data is logged and fsync'd. */ - unbuffered_prep(&wstate, true, false); + unbuffered_prep(&wstate, true, true); /* Construct metapage. */ metapage = (Page) palloc(BLCKSZ); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index d6d0d4b361..f1b9e2e24e 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) int64 tuples_done = 0; bool deduplicate; - unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false); + /* + * The fsync optimization done by directmgr is only relevant if + * WAL-logging, so pass btws_use_wal for this parameter. + */ + unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal); deduplicate = wstate->inskey->allequalimage && !btspool->isunique && BTGetDeduplicateItems(wstate->index); -- 2.30.2
From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Fri, 4 Mar 2022 14:48:39 -0500 Subject: [PATCH v6 1/4] Add unbuffered IO API Wrap unbuffered extends and writes in a new API, directmgr. When writing data outside of shared buffers, the backend must do a series of steps to ensure the data is both durable and recoverable. When writing or extending a page of data outside of shared buffers the backend must log the write or extend (if table is WAL-logged), checksum the page (if it is not empty), and write it out before moving on. Additionally, the backend must fsync the page data to ensure it reaches permanent storage since checkpointer is unaware of the buffer and could move the Redo pointer past the associated WAL for this write/extend before the data is safely on permanent storage. This commit introduces no functional change. It replaces many current callers of smgrimmedsync(), smgrextend(), and smgrwrite() with the equivalent directmgr functions. Consolidating these steps makes IO outside of shared buffers less error-prone. --- contrib/bloom/blinsert.c | 25 ++++--- src/backend/access/gist/gistbuild.c | 50 +++++-------- src/backend/access/heap/rewriteheap.c | 71 ++++++------------- src/backend/access/nbtree/nbtree.c | 26 ++++--- src/backend/access/nbtree/nbtsort.c | 73 ++++++++----------- src/backend/access/spgist/spginsert.c | 41 +++++------ src/backend/catalog/storage.c | 29 ++------ src/backend/storage/Makefile | 2 +- src/backend/storage/direct/Makefile | 17 +++++ src/backend/storage/direct/directmgr.c | 98 ++++++++++++++++++++++++++ src/include/storage/directmgr.h | 53 ++++++++++++++ 11 files changed, 296 insertions(+), 189 deletions(-) create mode 100644 src/backend/storage/direct/Makefile create mode 100644 src/backend/storage/direct/directmgr.c create mode 100644 src/include/storage/directmgr.h diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index c94cf34e69..7954a17e2d 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -19,8 +19,8 @@ #include "catalog/index.h" #include "miscadmin.h" #include "storage/bufmgr.h" +#include "storage/directmgr.h" #include "storage/indexfsm.h" -#include "storage/smgr.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -164,6 +164,16 @@ void blbuildempty(Relation index) { Page metapage; + UnBufferedWriteState wstate; + + /* + * Though this is an unlogged relation, pass do_wal=true since the init + * fork of an unlogged index must be wal-logged and fsync'd. This currently + * has no effect, as unbuffered_write() does not do log_newpage() + * internally. However, were this to be replaced with unbuffered_extend(), + * do_wal must be true to ensure the data is logged and fsync'd. + */ + unbuffered_prep(&wstate, true); /* Construct metapage. */ metapage = (Page) palloc(BLCKSZ); @@ -176,18 +186,13 @@ blbuildempty(Relation index) * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need * this even when wal_level=minimal. */ - PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, - (char *) metapage, true); + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, + BLOOM_METAPAGE_BLKNO, metapage); + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, metapage, true); - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM); + unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); } /* diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index e081e6571a..fc09938f80 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -43,6 +43,7 @@ #include "miscadmin.h" #include "optimizer/optimizer.h" #include "storage/bufmgr.h" +#include "storage/directmgr.h" #include "storage/smgr.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -91,6 +92,7 @@ typedef struct int64 indtuples; /* number of tuples indexed */ + UnBufferedWriteState ub_wstate; /* * Extra data structures used during a buffering build. 'gfbb' contains * information related to managing the build buffers. 'parentMap' is a @@ -410,13 +412,15 @@ gist_indexsortbuild(GISTBuildState *state) state->pages_written = 0; state->ready_num_pages = 0; + unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel)); + /* * Write an empty page as a placeholder for the root page. It will be * replaced with the real root page at the end. */ page = palloc0(BLCKSZ); - smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO, - page, true); + unbuffered_extend(&state->ub_wstate, RelationGetSmgr(state->indexrel), + MAIN_FORKNUM, GIST_ROOT_BLKNO, page, true); state->pages_allocated++; state->pages_written++; @@ -458,27 +462,19 @@ gist_indexsortbuild(GISTBuildState *state) /* Write out the root */ PageSetLSN(levelstate->pages[0], GistBuildLSN); - PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO); - smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO, - levelstate->pages[0], true); + + unbuffered_write(&state->ub_wstate, RelationGetSmgr(state->indexrel), + MAIN_FORKNUM, GIST_ROOT_BLKNO, levelstate->pages[0]); + if (RelationNeedsWAL(state->indexrel)) log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO, levelstate->pages[0], true); + unbuffered_finish(&state->ub_wstate, RelationGetSmgr(state->indexrel), + MAIN_FORKNUM); + pfree(levelstate->pages[0]); pfree(levelstate); - - /* - * When we WAL-logged index pages, we must nonetheless fsync index files. - * Since we're building outside shared buffers, a CHECKPOINT occurring - * during the build has no way to flush the previously written data to - * disk (indeed it won't know the index even exists). A crash later on - * would replay WAL from the checkpoint, therefore it wouldn't replay our - * earlier WAL entries. If we do not fsync those pages here, they might - * still not be on disk when the crash occurs. - */ - if (RelationNeedsWAL(state->indexrel)) - smgrimmedsync(RelationGetSmgr(state->indexrel), MAIN_FORKNUM); } /* @@ -645,26 +641,18 @@ gist_indexsortbuild_flush_ready_pages(GISTBuildState *state) if (state->ready_num_pages == 0) return; + /* Currently, the blocks must be buffered in order. */ for (int i = 0; i < state->ready_num_pages; i++) { - Page page = state->ready_pages[i]; - BlockNumber blkno = state->ready_blknos[i]; - - /* Currently, the blocks must be buffered in order. */ - if (blkno != state->pages_written) + if (state->ready_blknos[i] != state->pages_written) elog(ERROR, "unexpected block number to flush GiST sorting build"); - - PageSetLSN(page, GistBuildLSN); - PageSetChecksumInplace(page, blkno); - smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, blkno, page, - true); - state->pages_written++; } - if (RelationNeedsWAL(state->indexrel)) - log_newpages(&state->indexrel->rd_node, MAIN_FORKNUM, state->ready_num_pages, - state->ready_blknos, state->ready_pages, true); + unbuffered_extend_range(&state->ub_wstate, + RelationGetSmgr(state->indexrel), MAIN_FORKNUM, + state->ready_num_pages, state->ready_blknos, state->ready_pages, + false, GistBuildLSN); for (int i = 0; i < state->ready_num_pages; i++) pfree(state->ready_pages[i]); diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 2a53826736..c8fa8bb27c 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -81,15 +81,15 @@ * in the whole table. Note that if we do fail halfway through a CLUSTER, * the old table is still valid, so failure is not catastrophic. * - * We can't use the normal heap_insert function to insert into the new - * heap, because heap_insert overwrites the visibility information. - * We use a special-purpose raw_heap_insert function instead, which - * is optimized for bulk inserting a lot of tuples, knowing that we have - * exclusive access to the heap. raw_heap_insert builds new pages in - * local storage. When a page is full, or at the end of the process, - * we insert it to WAL as a single record and then write it to disk - * directly through smgr. Note, however, that any data sent to the new - * heap's TOAST table will go through the normal bufmgr. + * We can't use the normal heap_insert function to insert into the new heap, + * because heap_insert overwrites the visibility information. We use a + * special-purpose raw_heap_insert function instead, which is optimized for + * bulk inserting a lot of tuples, knowing that we have exclusive access to the + * heap. raw_heap_insert builds new pages in local storage. When a page is + * full, or at the end of the process, we insert it to WAL as a single record + * and then write it to disk directly through directmgr. Note, however, that + * any data sent to the new heap's TOAST table will go through the normal + * bufmgr. * * * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group @@ -119,9 +119,9 @@ #include "replication/logical.h" #include "replication/slot.h" #include "storage/bufmgr.h" +#include "storage/directmgr.h" #include "storage/fd.h" #include "storage/procarray.h" -#include "storage/smgr.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -152,6 +152,7 @@ typedef struct RewriteStateData HTAB *rs_old_new_tid_map; /* unmatched B tuples */ HTAB *rs_logical_mappings; /* logical remapping files */ uint32 rs_num_rewrite_mappings; /* # in memory mappings */ + UnBufferedWriteState rs_unbuffered_wstate; } RewriteStateData; /* @@ -265,6 +266,10 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm state->rs_cutoff_multi = cutoff_multi; state->rs_cxt = rw_cxt; + unbuffered_prep(&state->rs_unbuffered_wstate, + RelationNeedsWAL(state->rs_new_rel)); + + /* Initialize hash tables used to track update chains */ hash_ctl.keysize = sizeof(TidHashKey); hash_ctl.entrysize = sizeof(UnresolvedTupData); @@ -317,28 +322,13 @@ end_heap_rewrite(RewriteState state) /* Write the last page, if any */ if (state->rs_buffer_valid) { - if (RelationNeedsWAL(state->rs_new_rel)) - log_newpage(&state->rs_new_rel->rd_node, - MAIN_FORKNUM, - state->rs_blockno, - state->rs_buffer, - true); - - PageSetChecksumInplace(state->rs_buffer, state->rs_blockno); - - smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM, - state->rs_blockno, (char *) state->rs_buffer, true); + unbuffered_extend(&state->rs_unbuffered_wstate, + RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM, + state->rs_blockno, state->rs_buffer, false); } - /* - * When we WAL-logged rel pages, we must nonetheless fsync them. The - * reason is the same as in storage.c's RelationCopyStorage(): we're - * writing data that's not in shared buffers, and so a CHECKPOINT - * occurring during the rewriteheap operation won't have fsync'd data we - * wrote before the checkpoint. - */ - if (RelationNeedsWAL(state->rs_new_rel)) - smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM); + unbuffered_finish(&state->rs_unbuffered_wstate, + RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM); logical_end_heap_rewrite(state); @@ -676,24 +666,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup) * contains a tuple. Hence, unlike RelationGetBufferForTuple(), * enforce saveFreeSpace unconditionally. */ - - /* XLOG stuff */ - if (RelationNeedsWAL(state->rs_new_rel)) - log_newpage(&state->rs_new_rel->rd_node, - MAIN_FORKNUM, - state->rs_blockno, - page, - true); - - /* - * Now write the page. We say skipFsync = true because there's no - * need for smgr to schedule an fsync for this write; we'll do it - * ourselves in end_heap_rewrite. - */ - PageSetChecksumInplace(page, state->rs_blockno); - - smgrextend(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM, - state->rs_blockno, (char *) page, true); + unbuffered_extend(&state->rs_unbuffered_wstate, + RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM, + state->rs_blockno, page, false); state->rs_blockno++; state->rs_buffer_valid = false; diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index c9b4964c1e..c7bf971917 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -30,10 +30,10 @@ #include "pgstat.h" #include "postmaster/autovacuum.h" #include "storage/condition_variable.h" +#include "storage/directmgr.h" #include "storage/indexfsm.h" #include "storage/ipc.h" #include "storage/lmgr.h" -#include "storage/smgr.h" #include "utils/builtins.h" #include "utils/index_selfuncs.h" #include "utils/memutils.h" @@ -152,6 +152,16 @@ void btbuildempty(Relation index) { Page metapage; + UnBufferedWriteState wstate; + + /* + * Though this is an unlogged relation, pass do_wal=true since the init + * fork of an unlogged index must be wal-logged and fsync'd. This currently + * has no effect, as unbuffered_write() does not do log_newpage() + * internally. However, were this to be replaced with unbuffered_extend(), + * do_wal must be true to ensure the data is logged and fsync'd. + */ + unbuffered_prep(&wstate, true); /* Construct metapage. */ metapage = (Page) palloc(BLCKSZ); @@ -164,18 +174,12 @@ btbuildempty(Relation index) * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need * this even when wal_level=minimal. */ - PageSetChecksumInplace(metapage, BTREE_METAPAGE); - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE, - (char *) metapage, true); + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, + BTREE_METAPAGE, metapage); log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM, BTREE_METAPAGE, metapage, true); - /* - * An immediate sync is required even if we xlog'd the page, because the - * write did not go through shared_buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM); + unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); } /* @@ -959,7 +963,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, * delete some deletable tuples. Hence, we must repeatedly check the * relation length. We must acquire the relation-extension lock while * doing so to avoid a race condition: if someone else is extending the - * relation, there is a window where bufmgr/smgr have created a new + * relation, there is a window where bufmgr/directmgr have created a new * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If * we manage to scan such a page here, we'll improperly assume it can be * recycled. Taking the lock synchronizes things enough to prevent a diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 8a19de2f66..e280253127 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -23,13 +23,13 @@ * many upper pages if the keys are reasonable-size) without risking a lot of * cascading splits during early insertions. * - * Formerly the index pages being built were kept in shared buffers, but - * that is of no value (since other backends have no interest in them yet) - * and it created locking problems for CHECKPOINT, because the upper-level - * pages were held exclusive-locked for long periods. Now we just build - * the pages in local memory and smgrwrite or smgrextend them as we finish - * them. They will need to be re-read into shared buffers on first use after - * the build finishes. + * Formerly the index pages being built were kept in shared buffers, but that + * is of no value (since other backends have no interest in them yet) and it + * created locking problems for CHECKPOINT, because the upper-level pages were + * held exclusive-locked for long periods. Now we just build the pages in + * local memory and write or extend them with directmgr as we finish them. + * They will need to be re-read into shared buffers on first use after the + * build finishes. * * This code isn't concerned about the FSM at all. The caller is responsible * for initializing that. @@ -57,7 +57,7 @@ #include "executor/instrument.h" #include "miscadmin.h" #include "pgstat.h" -#include "storage/smgr.h" +#include "storage/directmgr.h" #include "tcop/tcopprot.h" /* pgrminclude ignore */ #include "utils/rel.h" #include "utils/sortsupport.h" @@ -256,6 +256,7 @@ typedef struct BTWriteState BlockNumber btws_pages_alloced; /* # pages allocated */ BlockNumber btws_pages_written; /* # pages written out */ Page btws_zeropage; /* workspace for filling zeroes */ + UnBufferedWriteState ub_wstate; } BTWriteState; @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level) static void _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) { - /* XLOG stuff */ - if (wstate->btws_use_wal) - { - /* We use the XLOG_FPI record type for this */ - log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true); - } - /* * If we have to write pages nonsequentially, fill in the space with * zeroes until we come back and overwrite. This is not logically @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) { if (!wstate->btws_zeropage) wstate->btws_zeropage = (Page) palloc0(BLCKSZ); - /* don't set checksum for all-zero page */ - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, - wstate->btws_pages_written++, - (char *) wstate->btws_zeropage, - true); + + unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index), + MAIN_FORKNUM, wstate->btws_pages_written++, + wstate->btws_zeropage, true); } - PageSetChecksumInplace(page, blkno); - /* - * Now write the page. There's no need for smgr to schedule an fsync for - * this write; we'll do it ourselves before ending the build. - */ + /* Now write the page. Either we are extending the file... */ if (blkno == wstate->btws_pages_written) { - /* extending the file... */ - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, - (char *) page, true); + unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index), + MAIN_FORKNUM, blkno, page, false); + wstate->btws_pages_written++; } + + /* or we are overwriting a block we zero-filled before. */ else { - /* overwriting a block we zero-filled before */ - smgrwrite(RelationGetSmgr(wstate->index), MAIN_FORKNUM, blkno, - (char *) page, true); - } + unbuffered_write(&wstate->ub_wstate, RelationGetSmgr(wstate->index), + MAIN_FORKNUM, blkno, page); + + /* We use the XLOG_FPI record type for this */ + if (wstate->btws_use_wal) + log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true); + } pfree(page); } @@ -1195,6 +1189,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) int64 tuples_done = 0; bool deduplicate; + unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal); + deduplicate = wstate->inskey->allequalimage && !btspool->isunique && BTGetDeduplicateItems(wstate->index); @@ -1421,17 +1417,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) /* Close down final pages and write the metapage */ _bt_uppershutdown(wstate, state); - /* - * When we WAL-logged index pages, we must nonetheless fsync index files. - * Since we're building outside shared buffers, a CHECKPOINT occurring - * during the build has no way to flush the previously written data to - * disk (indeed it won't know the index even exists). A crash later on - * would replay WAL from the checkpoint, therefore it wouldn't replay our - * earlier WAL entries. If we do not fsync those pages here, they might - * still not be on disk when the crash occurs. - */ - if (wstate->btws_use_wal) - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM); + unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index), + MAIN_FORKNUM); } /* diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index bfb74049d0..318dbee823 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -25,7 +25,7 @@ #include "catalog/index.h" #include "miscadmin.h" #include "storage/bufmgr.h" -#include "storage/smgr.h" +#include "storage/directmgr.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -156,48 +156,43 @@ void spgbuildempty(Relation index) { Page page; + UnBufferedWriteState wstate; + + /* + * Though this is an unlogged relation, pass do_wal=true since the init + * fork of an unlogged index must be wal-logged and fsync'd. This currently + * has no effect, as unbuffered_write() does not do log_newpage() + * internally. However, were this to be replaced with unbuffered_extend(), + * do_wal must be true to ensure the data is logged and fsync'd. + */ + unbuffered_prep(&wstate, true); /* Construct metapage. */ page = (Page) palloc(BLCKSZ); SpGistInitMetapage(page); - /* - * Write the page and log it unconditionally. This is important - * particularly for indexes created on tablespaces and databases whose - * creation happened after the last redo pointer as recovery removes any - * of their existing content when the corresponding create records are - * replayed. - */ - PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO); - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, - (char *) page, true); + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, + SPGIST_METAPAGE_BLKNO, page); log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO, page, true); /* Likewise for the root page. */ SpGistInitPage(page, SPGIST_LEAF); - PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO); - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO, - (char *) page, true); + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, + SPGIST_ROOT_BLKNO, page); log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, SPGIST_ROOT_BLKNO, page, true); /* Likewise for the null-tuples root page. */ SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS); - PageSetChecksumInplace(page, SPGIST_NULL_BLKNO); - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO, - (char *) page, true); + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, + SPGIST_NULL_BLKNO, page); log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, SPGIST_NULL_BLKNO, page, true); - /* - * An immediate sync is required even if we xlog'd the pages, because the - * writes did not go through shared buffers and therefore a concurrent - * checkpoint may have moved the redo pointer past our xlog record. - */ - smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM); + unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); } /* diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9b8075536a..0b211895c1 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -28,6 +28,7 @@ #include "catalog/storage.h" #include "catalog/storage_xlog.h" #include "miscadmin.h" +#include "storage/directmgr.h" #include "storage/freespace.h" #include "storage/smgr.h" #include "utils/hsearch.h" @@ -420,6 +421,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, bool copying_initfork; BlockNumber nblocks; BlockNumber blkno; + UnBufferedWriteState wstate; + page = (Page) buf.data; @@ -440,6 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, use_wal = XLogIsNeeded() && (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); + unbuffered_prep(&wstate, use_wal); + nblocks = smgrnblocks(src, forkNum); for (blkno = 0; blkno < nblocks; blkno++) @@ -474,30 +479,10 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, * page this is, so we have to log the full page including any unused * space. */ - if (use_wal) - log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false); - - PageSetChecksumInplace(page, blkno); - - /* - * Now write the page. We say skipFsync = true because there's no - * need for smgr to schedule an fsync for this write; we'll do it - * ourselves below. - */ - smgrextend(dst, forkNum, blkno, buf.data, true); + unbuffered_extend(&wstate, dst, forkNum, blkno, page, false); } - /* - * When we WAL-logged rel pages, we must nonetheless fsync them. The - * reason is that since we're copying outside shared buffers, a CHECKPOINT - * occurring during the copy has no way to flush the previously written - * data to disk (indeed it won't know the new rel even exists). A crash - * later on would replay WAL from the checkpoint, therefore it wouldn't - * replay our earlier WAL entries. If we do not fsync those pages here, - * they might still not be on disk when the crash occurs. - */ - if (use_wal || copying_initfork) - smgrimmedsync(dst, forkNum); + unbuffered_finish(&wstate, dst, forkNum); } /* diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile index 8376cdfca2..501fae5f9d 100644 --- a/src/backend/storage/Makefile +++ b/src/backend/storage/Makefile @@ -8,6 +8,6 @@ subdir = src/backend/storage top_builddir = ../../.. include $(top_builddir)/src/Makefile.global -SUBDIRS = buffer file freespace ipc large_object lmgr page smgr sync +SUBDIRS = buffer direct file freespace ipc large_object lmgr page smgr sync include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/direct/Makefile b/src/backend/storage/direct/Makefile new file mode 100644 index 0000000000..d82bbed48c --- /dev/null +++ b/src/backend/storage/direct/Makefile @@ -0,0 +1,17 @@ +#------------------------------------------------------------------------- +# +# Makefile-- +# Makefile for storage/direct +# +# IDENTIFICATION +# src/backend/storage/direct/Makefile +# +#------------------------------------------------------------------------- + +subdir = src/backend/storage/direct +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global + +OBJS = directmgr.o + +include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c new file mode 100644 index 0000000000..42c37daa7a --- /dev/null +++ b/src/backend/storage/direct/directmgr.c @@ -0,0 +1,98 @@ +/*------------------------------------------------------------------------- + * + * directmgr.c + * routines for managing unbuffered IO + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/backend/storage/direct/directmgr.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + + +#include "access/xlogdefs.h" +#include "access/xloginsert.h" +#include "storage/directmgr.h" + +void +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal) +{ + wstate->do_wal = do_wal; +} + +void +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation + smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool + empty) +{ + /* + * Don't checksum empty pages + */ + if (!empty) + PageSetChecksumInplace(page, blocknum); + + smgrextend(smgrrel, forknum, blocknum, (char *) page, true); + + /* + * Don't WAL-log empty pages + */ + if (!empty && wstate->do_wal) + log_newpage(&(smgrrel)->smgr_rnode.node, forknum, + blocknum, page, true); +} + +void +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel, + ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages, + bool empty, XLogRecPtr custom_lsn) +{ + for (int i = 0; i < num_pages; i++) + { + Page page = pages[i]; + BlockNumber blkno = blocknums[i]; + + if (!XLogRecPtrIsInvalid(custom_lsn)) + PageSetLSN(page, custom_lsn); + + if (!empty) + PageSetChecksumInplace(page, blkno); + + smgrextend(smgrrel, forknum, blkno, (char *) page, true); + } + + if (!empty && wstate->do_wal) + log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages, + blocknums, pages, true); +} + +void +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber + forknum, BlockNumber blocknum, Page page) +{ + PageSetChecksumInplace(page, blocknum); + + smgrwrite(smgrrel, forknum, blocknum, (char *) page, true); +} + +/* + * When writing data outside shared buffers, a concurrent CHECKPOINT can move + * the redo pointer past our WAL entries and won't flush our data to disk. If + * the database crashes before the data makes it to disk, our WAL won't be + * replayed and the data will be lost. + * Thus, if a CHECKPOINT begins between unbuffered_prep() and + * unbuffered_finish(), the backend must fsync the data itself. + */ +void +unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel, + ForkNumber forknum) +{ + if (!wstate->do_wal) + return; + + smgrimmedsync(smgrrel, forknum); +} diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h new file mode 100644 index 0000000000..db5e3b1cac --- /dev/null +++ b/src/include/storage/directmgr.h @@ -0,0 +1,53 @@ +/*------------------------------------------------------------------------- + * + * directmgr.h + * POSTGRES unbuffered IO manager definitions. + * + * + * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/directmgr.h + * + *------------------------------------------------------------------------- + */ +#ifndef DIRECTMGR_H +#define DIRECTMGR_H + +#include "common/relpath.h" +#include "storage/block.h" +#include "storage/bufpage.h" +#include "storage/smgr.h" + +typedef struct UnBufferedWriteState +{ + /* + * When writing WAL-logged relation data outside of shared buffers, there + * is a risk of a concurrent CHECKPOINT moving the redo pointer past the + * data's associated WAL entries. To avoid this, callers in this situation + * must fsync the pages they have written themselves. This is necessary + * only if the relation is WAL-logged or in special cases such as the init + * fork of an unlogged index. + */ + bool do_wal; +} UnBufferedWriteState; +/* + * prototypes for functions in directmgr.c + */ +extern void +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal); +extern void +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel, + ForkNumber forknum, BlockNumber blocknum, Page page, bool empty); +extern void +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel, + ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages, + bool empty, XLogRecPtr custom_lsn); +extern void +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber + forknum, BlockNumber blocknum, Page page); +extern void +unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel, + ForkNumber forknum); + +#endif /* DIRECTMGR_H */ -- 2.30.2
From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Fri, 4 Mar 2022 15:52:45 -0500 Subject: [PATCH v6 4/4] Use shared buffers when possible for index build When there are not too many tuples, building the index in shared buffers makes sense. It allows the buffer manager to handle how best to do the IO. --- src/backend/access/nbtree/nbtree.c | 32 ++-- src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++------- 2 files changed, 223 insertions(+), 82 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index fc5cce4603..d3982b9388 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -152,34 +152,24 @@ void btbuildempty(Relation index) { Page metapage; - UnBufferedWriteState wstate; + Buffer metabuf; /* - * Though this is an unlogged relation, pass do_wal=true since the init - * fork of an unlogged index must be wal-logged and fsync'd. This currently - * has no effect, as unbuffered_write() does not do log_newpage() - * internally. However, were this to be replaced with unbuffered_extend(), - * do_wal must be true to ensure the data is logged and fsync'd. + * Allocate a buffer for metapage and initialize metapage. */ - unbuffered_prep(&wstate, true, true); - - /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); + metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, + NULL); + metapage = BufferGetPage(metabuf); _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); /* - * Write the page and log it. It might seem that an immediate sync would - * be sufficient to guarantee that the file exists on disk, but recovery - * itself might remove it while replaying, for example, an - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need - * this even when wal_level=minimal. + * Mark metapage buffer as dirty and XLOG it */ - unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, - BTREE_METAPAGE, metapage); - log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM, - BTREE_METAPAGE, metapage, true); - - unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); + START_CRIT_SECTION(); + MarkBufferDirty(metabuf); + log_newpage_buffer(metabuf, true); + END_CRIT_SECTION(); + _bt_relbuf(index, metabuf); } /* diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index f1b9e2e24e..35ded6e4a1 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -23,13 +23,15 @@ * many upper pages if the keys are reasonable-size) without risking a lot of * cascading splits during early insertions. * - * Formerly the index pages being built were kept in shared buffers, but that - * is of no value (since other backends have no interest in them yet) and it - * created locking problems for CHECKPOINT, because the upper-level pages were - * held exclusive-locked for long periods. Now we just build the pages in - * local memory and write or extend them with directmgr as we finish them. - * They will need to be re-read into shared buffers on first use after the - * build finishes. + * If indexing little enough data, it can be built efficiently in shared + * buffers, leaving checkpointer to deal with fsync'ing the data at its + * convenience. However, if indexing many pages, building the index in shared + * buffers could delay CHECKPOINTs--since the upper-level pages are + * exclusive-locked for long periods. In that case, build the pages in local + * memory and write or extend them with directmgr as they are finished. If a + * CHECKPOINT has begun since the build started, directmgr will fsync the + * relation file itself after finishing the build. The index will then need to + * be re-read into shared buffers on first use after the build finishes. * * This code isn't concerned about the FSM at all. The caller is responsible * for initializing that. @@ -236,6 +238,7 @@ typedef struct BTPageState { Page btps_page; /* workspace for page building */ BlockNumber btps_blkno; /* block # to write this page at */ + Buffer btps_buf; /* buffer to write this page to */ IndexTuple btps_lowkey; /* page's strict lower bound pivot tuple */ OffsetNumber btps_lastoff; /* last item offset loaded */ Size btps_lastextra; /* last item's extra posting list space */ @@ -253,10 +256,26 @@ typedef struct BTWriteState Relation index; BTScanInsert inskey; /* generic insertion scankey */ bool btws_use_wal; /* dump pages to WAL? */ - BlockNumber btws_pages_alloced; /* # pages allocated */ - BlockNumber btws_pages_written; /* # pages written out */ + BlockNumber btws_pages_alloced; /* # pages allocated for index builds outside SB */ + BlockNumber btws_pages_written; /* # pages written out for index builds outside SB */ Page btws_zeropage; /* workspace for filling zeroes */ UnBufferedWriteState ub_wstate; + /* + * Allocate a new btree page. This does not initialize the page. + */ + Page (*_bt_bl_alloc_page) (struct BTWriteState *wstate, BlockNumber + *blockno, Buffer *buf); + /* + * Emit a completed btree page, and release the working storage. + */ + void (*_bt_blwritepage) (struct BTWriteState *wstate, Page page, + BlockNumber blkno, Buffer buf); + + void (*_bt_bl_unbuffered_prep) (UnBufferedWriteState *wstate, bool do_wal, + bool do_optimization); + + void (*_bt_bl_unbuffered_finish) (UnBufferedWriteState *wstate, + SMgrRelation smgrrel, ForkNumber forknum); } BTWriteState; @@ -265,10 +284,22 @@ static double _bt_spools_heapscan(Relation heap, Relation index, static void _bt_spooldestroy(BTSpool *btspool); static void _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull); -static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2); +static void _bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2); static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values, bool *isnull, bool tupleIsAlive, void *state); -static Page _bt_blnewpage(uint32 level); + +static Page +_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf); +static Page +_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf); + +static Page _bt_blnewpage(uint32 level, Page page); + +static void +_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf); +static void +_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf); + static BTPageState *_bt_pagestate(BTWriteState *wstate, uint32 level); static void _bt_slideleft(Page rightmostpage); static void _bt_sortaddtup(Page page, Size itemsize, @@ -279,9 +310,10 @@ static void _bt_buildadd(BTWriteState *wstate, BTPageState *state, static void _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state, BTDedupState dstate); -static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state); static void _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2); +static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer + metabuf, Page metapage); static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request); static void _bt_end_parallel(BTLeader *btleader); @@ -294,6 +326,21 @@ static void _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2, Sharedsort *sharedsort2, int sortmem, bool progress); +#define BT_BUILD_SB_THRESHOLD 1024 + +static const BTWriteState wstate_shared = { + ._bt_bl_alloc_page = _bt_bl_alloc_page_shared, + ._bt_blwritepage = _bt_blwritepage_shared, + ._bt_bl_unbuffered_prep = NULL, + ._bt_bl_unbuffered_finish = NULL, +}; + +static const BTWriteState wstate_direct = { + ._bt_bl_alloc_page = _bt_bl_alloc_page_direct, + ._bt_blwritepage = _bt_blwritepage_direct, + ._bt_bl_unbuffered_prep = unbuffered_prep, + ._bt_bl_unbuffered_finish = unbuffered_finish, +}; /* * btbuild() -- build a new btree index. @@ -303,6 +350,7 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo) { IndexBuildResult *result; BTBuildState buildstate; + BTWriteState writestate; double reltuples; #ifdef BTREE_BUILD_STATS @@ -333,8 +381,12 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo) * Finish the build by (1) completing the sort of the spool file, (2) * inserting the sorted tuples into btree pages and (3) building the upper * levels. Finally, it may also be necessary to end use of parallelism. + * + * Don't use shared buffers if the number of tuples is too large. */ - _bt_leafbuild(buildstate.spool, buildstate.spool2); + writestate = reltuples < BT_BUILD_SB_THRESHOLD ? wstate_shared : wstate_direct; + + _bt_leafbuild(&writestate, buildstate.spool, buildstate.spool2); _bt_spooldestroy(buildstate.spool); if (buildstate.spool2) _bt_spooldestroy(buildstate.spool2); @@ -542,10 +594,8 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull) * create an entire btree. */ static void -_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) +_bt_leafbuild(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) { - BTWriteState wstate; - #ifdef BTREE_BUILD_STATS if (log_btree_build_stats) { @@ -565,21 +615,38 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) tuplesort_performsort(btspool2->sortstate); } - wstate.heap = btspool->heap; - wstate.index = btspool->index; - wstate.inskey = _bt_mkscankey(wstate.index, NULL); - /* _bt_mkscankey() won't set allequalimage without metapage */ - wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true); - wstate.btws_use_wal = RelationNeedsWAL(wstate.index); + wstate->heap = btspool->heap; + wstate->index = btspool->index; + wstate->inskey = _bt_mkscankey(wstate->index, NULL); + /* _bt-mkscankey() won't set allequalimage without metapage */ + wstate->inskey->allequalimage = _bt_allequalimage(wstate->index, true); + wstate->btws_use_wal = RelationNeedsWAL(wstate->index); /* reserve the metapage */ - wstate.btws_pages_alloced = BTREE_METAPAGE + 1; - wstate.btws_pages_written = 0; - wstate.btws_zeropage = NULL; /* until needed */ + wstate->btws_pages_alloced = 0; + wstate->btws_pages_written = 0; + wstate->btws_zeropage = NULL; pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE, PROGRESS_BTREE_PHASE_LEAF_LOAD); - _bt_load(&wstate, btspool, btspool2); + + /* + * If not using shared buffers, for a WAL-logged relation, save the redo + * pointer location in case a checkpoint begins during the index build. + */ + if (wstate->_bt_bl_unbuffered_prep) + wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate, + wstate->btws_use_wal, wstate->btws_use_wal); + + _bt_load(wstate, btspool, btspool2); + + /* + * If not using shared buffers, for a WAL-logged relation, check if backend + * must fsync the page itself. + */ + if (wstate->_bt_bl_unbuffered_finish) + wstate->_bt_bl_unbuffered_finish(&wstate->ub_wstate, + RelationGetSmgr(wstate->index), MAIN_FORKNUM); } /* @@ -612,15 +679,15 @@ _bt_build_callback(Relation index, } /* - * allocate workspace for a new, clean btree page, not linked to any siblings. + * Set up workspace for a new, clean btree page, not linked to any siblings. + * Caller must allocate the passed in page. */ static Page -_bt_blnewpage(uint32 level) +_bt_blnewpage(uint32 level, Page page) { - Page page; BTPageOpaque opaque; - page = (Page) palloc(BLCKSZ); + Assert(page); /* Zero the page and set up standard page header info */ _bt_pageinit(page, BLCKSZ); @@ -638,11 +705,8 @@ _bt_blnewpage(uint32 level) return page; } -/* - * emit a completed btree page, and release the working storage. - */ static void -_bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) +_bt_blwritepage_direct(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf) { /* * If we have to write pages nonsequentially, fill in the space with @@ -685,6 +749,61 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) pfree(page); } +static void +_bt_blwritepage_shared(BTWriteState *wstate, Page page, BlockNumber blkno, Buffer buf) +{ + /* + * Indexes built in shared buffers need only to mark the buffer as dirty + * and XLOG it. + */ + Assert(buf); + START_CRIT_SECTION(); + MarkBufferDirty(buf); + if (wstate->btws_use_wal) + log_newpage_buffer(buf, true); + END_CRIT_SECTION(); + _bt_relbuf(wstate->index, buf); +} + +static Page +_bt_bl_alloc_page_direct(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf) +{ + /* buf is only used when using shared buffers, so set it to InvalidBuffer */ + *buf = InvalidBuffer; + + /* + * Assign block number for the page. + * This will be used to link to sibling page(s) later and, if this is the + * initial page in the level, saved in the BTPageState + */ + *blockno = wstate->btws_pages_alloced++; + + /* now allocate and set up the new page */ + return palloc(BLCKSZ); +} + +static Page +_bt_bl_alloc_page_shared(BTWriteState *wstate, BlockNumber *blockno, Buffer *buf) +{ + /* + * Find a shared buffer for the page. Pass mode RBM_ZERO_AND_LOCK to get an + * exclusive lock on the buffer content. No lock on the relation as a whole + * is needed (as in LockRelationForExtension()) because the initial index + * build is not yet complete. + */ + *buf = ReadBufferExtended(wstate->index, MAIN_FORKNUM, P_NEW, + RBM_ZERO_AND_LOCK, NULL); + + /* + * bufmgr will assign a block number for the new page. + * This will be used to link to sibling page(s) later and, if this is the + * initial page in the level, saved in the BTPageState + */ + *blockno = BufferGetBlockNumber(*buf); + + return BufferGetPage(*buf); +} + /* * allocate and initialize a new BTPageState. the returned structure * is suitable for immediate use by _bt_buildadd. @@ -692,13 +811,20 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) static BTPageState * _bt_pagestate(BTWriteState *wstate, uint32 level) { + Buffer buf; + BlockNumber blockno; + BTPageState *state = (BTPageState *) palloc0(sizeof(BTPageState)); - /* create initial page for level */ - state->btps_page = _bt_blnewpage(level); + /* + * Allocate and initialize initial page for the level, and if using shared + * buffers, extend the relation and allocate a shared buffer for the block. + */ + state->btps_page = _bt_blnewpage(level, wstate->_bt_bl_alloc_page(wstate, + &blockno, &buf)); - /* and assign it a page position */ - state->btps_blkno = wstate->btws_pages_alloced++; + state->btps_blkno = blockno; + state->btps_buf = buf; state->btps_lowkey = NULL; /* initialize lastoff so first item goes into P_FIRSTKEY */ @@ -833,6 +959,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, { Page npage; BlockNumber nblkno; + Buffer nbuf; OffsetNumber last_off; Size last_truncextra; Size pgspc; @@ -847,6 +974,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, npage = state->btps_page; nblkno = state->btps_blkno; + nbuf = state->btps_buf; last_off = state->btps_lastoff; last_truncextra = state->btps_lastextra; state->btps_lastextra = truncextra; @@ -903,15 +1031,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, */ Page opage = npage; BlockNumber oblkno = nblkno; + Buffer obuf = nbuf; ItemId ii; ItemId hii; IndexTuple oitup; - /* Create new page of same level */ - npage = _bt_blnewpage(state->btps_level); - - /* and assign it a page position */ - nblkno = wstate->btws_pages_alloced++; + /* Create and initialize a new page of same level */ + npage = _bt_blnewpage(state->btps_level, + wstate->_bt_bl_alloc_page(wstate, &nblkno, &nbuf)); /* * We copy the last item on the page into the new page, and then @@ -1021,9 +1148,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, /* * Write out the old page. We never need to touch it again, so we can - * free the opage workspace too. + * free the opage workspace too. obuf has been released and is no longer + * valid. */ - _bt_blwritepage(wstate, opage, oblkno); + wstate->_bt_blwritepage(wstate, opage, oblkno, obuf); + obuf = InvalidBuffer; + opage = NULL; /* * Reset last_off to point to new page @@ -1058,6 +1188,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup, state->btps_page = npage; state->btps_blkno = nblkno; + state->btps_buf = nbuf; state->btps_lastoff = last_off; } @@ -1103,12 +1234,12 @@ _bt_sort_dedup_finish_pending(BTWriteState *wstate, BTPageState *state, * Finish writing out the completed btree. */ static void -_bt_uppershutdown(BTWriteState *wstate, BTPageState *state) +_bt_uppershutdown(BTWriteState *wstate, BTPageState *state, Buffer metabuf, + Page metapage) { BTPageState *s; BlockNumber rootblkno = P_NONE; uint32 rootlevel = 0; - Page metapage; /* * Each iteration of this loop completes one more level of the tree. @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state) * back one slot. Then we can dump out the page. */ _bt_slideleft(s->btps_page); - _bt_blwritepage(wstate, s->btps_page, s->btps_blkno); + wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf); + s->btps_buf = InvalidBuffer; s->btps_page = NULL; /* writepage freed the workspace */ } /* - * As the last step in the process, construct the metapage and make it + * As the last step in the process, initialize the metapage and make it * point to the new root (unless we had no data at all, in which case it's * set to point to "P_NONE"). This changes the index to the "valid" state * by filling in a valid magic number in the metapage. + * After this, metapage will have been freed or invalid and metabuf, if ever + * valid, will have been released. */ - metapage = (Page) palloc(BLCKSZ); _bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->allequalimage); - _bt_blwritepage(wstate, metapage, BTREE_METAPAGE); + wstate->_bt_blwritepage(wstate, metapage, BTREE_METAPAGE, metabuf); + metabuf = InvalidBuffer; + metapage = NULL; } /* @@ -1177,6 +1312,10 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state) static void _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) { + Page metapage; + BlockNumber metablkno; + Buffer metabuf; + BTPageState *state = NULL; bool merge = (btspool2 != NULL); IndexTuple itup, @@ -1189,15 +1328,30 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) int64 tuples_done = 0; bool deduplicate; - /* - * The fsync optimization done by directmgr is only relevant if - * WAL-logging, so pass btws_use_wal for this parameter. - */ - unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal); - deduplicate = wstate->inskey->allequalimage && !btspool->isunique && BTGetDeduplicateItems(wstate->index); + /* + * Reserve block 0 for the metapage up front. + * + * When using the shared buffers API it is easier to allocate the buffer + * for block 0 first instead of trying skip block 0 and allocate it at the + * end of index build. + * + * When not using the shared buffers API, there is no harm in allocating + * the metapage first. When block 1 is written, the direct writepage + * function will zero-fill block 0. When writing out the metapage at the + * end of index build, it will overwrite that block 0. + * + * The metapage will be initialized and written out at the end of the index + * build when all of the information needed to do so is available. + * + * The block number will always be BTREE_METAPAGE, so the metablkno + * variable is unused and only created to avoid a special case in the + * direct alloc function. + */ + metapage = wstate->_bt_bl_alloc_page(wstate, &metablkno, &metabuf); + if (merge) { /* @@ -1419,10 +1573,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) } /* Close down final pages and write the metapage */ - _bt_uppershutdown(wstate, state); - - unbuffered_finish(&wstate->ub_wstate, RelationGetSmgr(wstate->index), - MAIN_FORKNUM); + _bt_uppershutdown(wstate, state, metabuf, metapage); } /* -- 2.30.2
From ac914bf0e227b3cb62918954a7d7567a3f09f3b7 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Fri, 4 Mar 2022 14:14:21 -0500 Subject: [PATCH v6 2/4] Avoid immediate fsync for unbuffered IO Data written to WAL-logged relations is durable once the WAL entries are on permanent storage; however, the XLOG Redo pointer cannot be moved past the associated WAL until the page data is safely on permanent storage. If a crash were to occur before the data is fsync'd, the WAL wouldn't be replayed during recovery, and the data would be lost. This is not a problem with pages written in shared buffers because the checkpointer will block until FlushBuffer() is complete for all buffers that were dirtied before it began. Therefore it will not move the Redo pointer past their associated WAL entries until it has fsync'd the data. A backend writing data outside of shared buffers must ensure that the data has reached permanent storage itself or that the Redo pointer has not moved while it was writing the data. In the common case, the backend should not have to do this fsync itself and can instead request the checkpointer do it. To ensure this is safe, the backend can save the XLOG Redo pointer location before doing the write or extend. Then it can add an fsync request for the page to the checkpointer's pending-ops table using the existing mechanism. After doing the write or extend, if the Redo pointer has moved (meaning a checkpoint has started since it saved it last), then the backend can simply fsync the page itself. Otherwise, the checkpointer takes care of fsync'ing the page the next time it processes the pending-ops table. This commit adds the optimization option to the directmgr API but does not add any users, so there is no behavior change. --- contrib/bloom/blinsert.c | 2 +- src/backend/access/gist/gistbuild.c | 2 +- src/backend/access/heap/rewriteheap.c | 3 +-- src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 2 +- src/backend/access/spgist/spginsert.c | 2 +- src/backend/access/transam/xlog.c | 13 +++++++++++++ src/backend/catalog/storage.c | 2 +- src/backend/storage/direct/directmgr.c | 14 +++++++++++++- src/include/access/xlog.h | 1 + src/include/storage/directmgr.h | 11 ++++++++++- 11 files changed, 44 insertions(+), 10 deletions(-) diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 7954a17e2d..fdde2bd88a 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -173,7 +173,7 @@ blbuildempty(Relation index) * internally. However, were this to be replaced with unbuffered_extend(), * do_wal must be true to ensure the data is logged and fsync'd. */ - unbuffered_prep(&wstate, true); + unbuffered_prep(&wstate, true, false); /* Construct metapage. */ metapage = (Page) palloc(BLCKSZ); diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index fc09938f80..8de19199a6 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -412,7 +412,7 @@ gist_indexsortbuild(GISTBuildState *state) state->pages_written = 0; state->ready_num_pages = 0; - unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel)); + unbuffered_prep(&state->ub_wstate, RelationNeedsWAL(state->indexrel), false); /* * Write an empty page as a placeholder for the root page. It will be diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index c8fa8bb27c..70b7a0f269 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -267,8 +267,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm state->rs_cxt = rw_cxt; unbuffered_prep(&state->rs_unbuffered_wstate, - RelationNeedsWAL(state->rs_new_rel)); - + RelationNeedsWAL(state->rs_new_rel), false); /* Initialize hash tables used to track update chains */ hash_ctl.keysize = sizeof(TidHashKey); diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index c7bf971917..6b78acefbe 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -161,7 +161,7 @@ btbuildempty(Relation index) * internally. However, were this to be replaced with unbuffered_extend(), * do_wal must be true to ensure the data is logged and fsync'd. */ - unbuffered_prep(&wstate, true); + unbuffered_prep(&wstate, true, false); /* Construct metapage. */ metapage = (Page) palloc(BLCKSZ); diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index e280253127..d6d0d4b361 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -1189,7 +1189,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) int64 tuples_done = 0; bool deduplicate; - unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal); + unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false); deduplicate = wstate->inskey->allequalimage && !btspool->isunique && BTGetDeduplicateItems(wstate->index); diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index 318dbee823..e30f3623f5 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -165,7 +165,7 @@ spgbuildempty(Relation index) * internally. However, were this to be replaced with unbuffered_extend(), * do_wal must be true to ensure the data is logged and fsync'd. */ - unbuffered_prep(&wstate, true); + unbuffered_prep(&wstate, true, false); /* Construct metapage. */ page = (Page) palloc(BLCKSZ); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 0d2bd7a357..db7b33ec67 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -5971,6 +5971,19 @@ GetLastImportantRecPtr(void) return res; } +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr) +{ + XLogRecPtr ptr; + SpinLockAcquire(&XLogCtl->info_lck); + ptr = XLogCtl->RedoRecPtr; + SpinLockRelease(&XLogCtl->info_lck); + + if (RedoRecPtr < ptr) + RedoRecPtr = ptr; + + return RedoRecPtr != comparator_ptr; +} + /* * Get the time and LSN of the last xlog segment switch */ diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 0b211895c1..2fd5a31ffd 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -443,7 +443,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, use_wal = XLogIsNeeded() && (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); - unbuffered_prep(&wstate, use_wal); + unbuffered_prep(&wstate, use_wal, false); nblocks = smgrnblocks(src, forkNum); diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c index 42c37daa7a..4c3a5a2f74 100644 --- a/src/backend/storage/direct/directmgr.c +++ b/src/backend/storage/direct/directmgr.c @@ -15,14 +15,23 @@ #include "postgres.h" +#include "access/xlog.h" #include "access/xlogdefs.h" #include "access/xloginsert.h" #include "storage/directmgr.h" void -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal) +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool + do_optimization) { + /* + * There is no valid fsync optimization if no WAL is being written anyway + */ + Assert(!do_optimization || (do_optimization && do_wal)); + wstate->do_wal = do_wal; + wstate->do_optimization = do_optimization; + wstate->redo = do_optimization ? GetRedoRecPtr() : InvalidXLogRecPtr; } void @@ -94,5 +103,8 @@ unbuffered_finish(UnBufferedWriteState *wstate, SMgrRelation smgrrel, if (!wstate->do_wal) return; + if (wstate->do_optimization && !RedoRecPtrChanged(wstate->redo)) + return; + smgrimmedsync(smgrrel, forknum); } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 4b45ac64db..71fe99a28d 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -241,6 +241,7 @@ extern XLogRecPtr GetInsertRecPtr(void); extern XLogRecPtr GetFlushRecPtr(TimeLineID *insertTLI); extern TimeLineID GetWALInsertionTimeLine(void); extern XLogRecPtr GetLastImportantRecPtr(void); +extern bool RedoRecPtrChanged(XLogRecPtr comparator_ptr); extern void SetWalWriterSleeping(bool sleeping); diff --git a/src/include/storage/directmgr.h b/src/include/storage/directmgr.h index db5e3b1cac..e5454a3296 100644 --- a/src/include/storage/directmgr.h +++ b/src/include/storage/directmgr.h @@ -14,6 +14,7 @@ #ifndef DIRECTMGR_H #define DIRECTMGR_H +#include "access/xlogdefs.h" #include "common/relpath.h" #include "storage/block.h" #include "storage/bufpage.h" @@ -28,14 +29,22 @@ typedef struct UnBufferedWriteState * must fsync the pages they have written themselves. This is necessary * only if the relation is WAL-logged or in special cases such as the init * fork of an unlogged index. + * + * These callers can optionally use the following optimization: attempt to + * use the sync request queue and fall back to fsync'ing the pages + * themselves if the Redo pointer moves between the start and finish of + * their write. In order to do this, they must set do_optimization to true + * so that the redo pointer is saved before the write begins. */ bool do_wal; + bool do_optimization; + XLogRecPtr redo; } UnBufferedWriteState; /* * prototypes for functions in directmgr.c */ extern void -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal); +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool do_optimization); extern void unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool empty); -- 2.30.2