At Tue, 24 Dec 2019 16:35:35 +0900 (JST), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in > I rebased the patch and changed the default value for the GUC variable > wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and > guc.c. 4096kB is choosed as it is the nice round number of 500 pages * > 8kB = 4000kB.
The value in the doc was not correct. Fixed only the value from 3192 to 4096kB. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
>From 2f184c140ab442ee29103be830b3389b71e8e609 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota....@gmail.com> Date: Thu, 21 Nov 2019 15:28:06 +0900 Subject: [PATCH v29] Rework WAL-skipping optimization While wal_level=minimal we omit WAL-logging for certain some operations on relfilenodes that are created in the current transaction. The files are fsynced at commit. The machinery accelerates bulk-insertion operations but it fails in certain sequence of operations and a crash just after commit may leave broken table files. This patch overhauls the machinery so that WAL-loggings on all operations are omitted for such relfilenodes. This patch also introduces a new feature that small files are emitted as a WAL record instead of syncing. The new GUC variable wal_skip_threshold controls the threshold. --- doc/src/sgml/config.sgml | 43 ++-- doc/src/sgml/perform.sgml | 47 +---- src/backend/access/gist/gistutil.c | 31 ++- src/backend/access/gist/gistxlog.c | 21 ++ src/backend/access/heap/heapam.c | 45 +--- src/backend/access/heap/heapam_handler.c | 22 +- src/backend/access/heap/rewriteheap.c | 21 +- src/backend/access/nbtree/nbtsort.c | 41 +--- src/backend/access/rmgrdesc/gistdesc.c | 5 + src/backend/access/transam/README | 47 ++++- src/backend/access/transam/xact.c | 15 ++ src/backend/access/transam/xloginsert.c | 10 +- src/backend/access/transam/xlogutils.c | 17 +- src/backend/catalog/heap.c | 4 + src/backend/catalog/storage.c | 257 +++++++++++++++++++++-- src/backend/commands/cluster.c | 31 +++ src/backend/commands/copy.c | 58 +---- src/backend/commands/createas.c | 11 +- src/backend/commands/matview.c | 12 +- src/backend/commands/tablecmds.c | 11 +- src/backend/storage/buffer/bufmgr.c | 123 ++++++++++- src/backend/storage/smgr/md.c | 35 ++- src/backend/storage/smgr/smgr.c | 37 ++++ src/backend/utils/cache/relcache.c | 122 ++++++++--- src/backend/utils/misc/guc.c | 13 ++ src/bin/psql/input.c | 1 + src/include/access/gist_private.h | 2 + src/include/access/gistxlog.h | 1 + src/include/access/heapam.h | 3 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 18 +- src/include/catalog/storage.h | 5 + src/include/storage/bufmgr.h | 4 + src/include/storage/smgr.h | 1 + src/include/utils/rel.h | 57 +++-- src/include/utils/relcache.h | 8 +- src/test/regress/pg_regress.c | 2 + 37 files changed, 839 insertions(+), 344 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5d1c90282f..d893864c40 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2481,21 +2481,14 @@ include_dir 'conf.d' levels. This parameter can only be set at server start. </para> <para> - In <literal>minimal</literal> level, WAL-logging of some bulk - operations can be safely skipped, which can make those - operations much faster (see <xref linkend="populate-pitr"/>). - Operations in which this optimization can be applied include: - <simplelist> - <member><command>CREATE TABLE AS</command></member> - <member><command>CREATE INDEX</command></member> - <member><command>CLUSTER</command></member> - <member><command>COPY</command> into tables that were created or truncated in the same - transaction</member> - </simplelist> - But minimal WAL does not contain enough information to reconstruct the - data from a base backup and the WAL logs, so <literal>replica</literal> or - higher must be used to enable WAL archiving - (<xref linkend="guc-archive-mode"/>) and streaming replication. + In <literal>minimal</literal> level, no information is logged for + tables or indexes for the remainder of a transaction that creates or + truncates them. This can make bulk operations much faster (see + <xref linkend="populate-pitr"/>). But minimal WAL does not contain + enough information to reconstruct the data from a base backup and the + WAL logs, so <literal>replica</literal> or higher must be used to + enable WAL archiving (<xref linkend="guc-archive-mode"/>) and + streaming replication. </para> <para> In <literal>logical</literal> level, the same information is logged as @@ -2887,6 +2880,26 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-wal-skip-threshold" xreflabel="wal_skip_threshold"> + <term><varname>wal_skip_threshold</varname> (<type>integer</type>) + <indexterm> + <primary><varname>wal_skip_threshold</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + When <varname>wal_level</varname> is <literal>minimal</literal> and a + transaction commits after creating or rewriting a permanent table, + materialized view, or index, this setting determines how to persist + the new data. If the data is smaller than this setting, write it to + the WAL log; otherwise, use an fsync of the data file. Depending on + the properties of your storage, raising or lowering this value might + help if such commits are slowing concurrent transactions. The default + is 4096 kilobytes (<literal>4096kB</literal>). + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-commit-delay" xreflabel="commit_delay"> <term><varname>commit_delay</varname> (<type>integer</type>) <indexterm> diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml index 0f61b0995d..12fda690fa 100644 --- a/doc/src/sgml/perform.sgml +++ b/doc/src/sgml/perform.sgml @@ -1606,8 +1606,8 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway. However, this consideration only applies when - <xref linkend="guc-wal-level"/> is <literal>minimal</literal> for - non-partitioned tables as all commands must write WAL otherwise. + <xref linkend="guc-wal-level"/> is <literal>minimal</literal> + as all commands must write WAL otherwise. </para> </sect2> @@ -1707,42 +1707,13 @@ SELECT * FROM x, y, a, b, c WHERE something AND somethingelse; </para> <para> - Aside from avoiding the time for the archiver or WAL sender to - process the WAL data, - doing this will actually make certain commands faster, because they - are designed not to write WAL at all if <varname>wal_level</varname> - is <literal>minimal</literal>. (They can guarantee crash safety more cheaply - by doing an <function>fsync</function> at the end than by writing WAL.) - This applies to the following commands: - <itemizedlist> - <listitem> - <para> - <command>CREATE TABLE AS SELECT</command> - </para> - </listitem> - <listitem> - <para> - <command>CREATE INDEX</command> (and variants such as - <command>ALTER TABLE ADD PRIMARY KEY</command>) - </para> - </listitem> - <listitem> - <para> - <command>ALTER TABLE SET TABLESPACE</command> - </para> - </listitem> - <listitem> - <para> - <command>CLUSTER</command> - </para> - </listitem> - <listitem> - <para> - <command>COPY FROM</command>, when the target table has been - created or truncated earlier in the same transaction - </para> - </listitem> - </itemizedlist> + Aside from avoiding the time for the archiver or WAL sender to process the + WAL data, doing this will actually make certain commands faster, because + they do not to write WAL at all if <varname>wal_level</varname> + is <literal>minimal</literal> and the current subtransaction (or top-level + transaction) created or truncated the table or index they change. (They + can guarantee crash safety more cheaply by doing + an <function>fsync</function> at the end than by writing WAL.) </para> </sect2> diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c index 553a6d67b1..8347673c5e 100644 --- a/src/backend/access/gist/gistutil.c +++ b/src/backend/access/gist/gistutil.c @@ -1004,23 +1004,44 @@ gistproperty(Oid index_oid, int attno, } /* - * Temporary and unlogged GiST indexes are not WAL-logged, but we need LSNs - * to detect concurrent page splits anyway. This function provides a fake - * sequence of LSNs for that purpose. + * Temporary, unlogged GiST and WAL-skipped indexes are not WAL-logged, but we + * need LSNs to detect concurrent page splits anyway. This function provides a + * fake sequence of LSNs for that purpose. */ XLogRecPtr gistGetFakeLSN(Relation rel) { - static XLogRecPtr counter = FirstNormalUnloggedLSN; - if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) { /* * Temporary relations are only accessible in our session, so a simple * backend-local counter will do. */ + static XLogRecPtr counter = FirstNormalUnloggedLSN; + return counter++; } + else if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) + { + /* + * WAL-logging on this relation will start after commit, so the LSN + * must be distinct numbers smaller than the LSN at the next + * commit. Emit a dummy WAL record if insert-LSN hasn't advanced after + * the last call. + */ + static XLogRecPtr lastlsn = InvalidXLogRecPtr; + XLogRecPtr currlsn = GetXLogInsertRecPtr(); + + /* Shouldn't be called for WAL-logging relations */ + Assert(!RelationNeedsWAL(rel)); + + /* No need for an actual record if we alredy have a distinct LSN */ + if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn) + currlsn = gistXLogAssignLSN(); + + lastlsn = currlsn; + return currlsn; + } else { /* diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c index 3b28f54646..ce17bc9dc3 100644 --- a/src/backend/access/gist/gistxlog.c +++ b/src/backend/access/gist/gistxlog.c @@ -449,6 +449,9 @@ gist_redo(XLogReaderState *record) case XLOG_GIST_PAGE_DELETE: gistRedoPageDelete(record); break; + case XLOG_GIST_ASSIGN_LSN: + /* nop. See gistGetFakeLSN(). */ + break; default: elog(PANIC, "gist_redo: unknown op code %u", info); } @@ -592,6 +595,24 @@ gistXLogPageDelete(Buffer buffer, FullTransactionId xid, return recptr; } +/* + * Write an empty XLOG record to assign a distinct LSN. + */ +XLogRecPtr +gistXLogAssignLSN(void) +{ + int dummy = 0; + + /* + * Records other than SWITCH_WAL must have content. We use an integer 0 to + * follow the restriction. + */ + XLogBeginInsert(); + XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT); + XLogRegisterData((char*) &dummy, sizeof(dummy)); + return XLogInsert(RM_GIST_ID, XLOG_GIST_ASSIGN_LSN); +} + /* * Write XLOG record about reuse of a deleted page. */ diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index e6d2b5f007..cf37e350c9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -21,7 +21,6 @@ * heap_multi_insert - insert multiple tuples into a relation * heap_delete - delete a tuple from a relation * heap_update - replace a tuple in a relation with another tuple - * heap_sync - sync heap, for when no WAL has been written * * NOTES * This file contains the heap_ routines which implement @@ -1936,7 +1935,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (RelationNeedsWAL(relation)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2119,7 +2118,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, /* currently not needed (thus unsupported) for heap_multi_insert() */ AssertArg(!(options & HEAP_INSERT_NO_LOGICAL)); - needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation); + needwal = RelationNeedsWAL(relation); saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); @@ -8920,46 +8919,6 @@ heap2_redo(XLogReaderState *record) } } -/* - * heap_sync - sync a heap, for use when no WAL has been written - * - * This forces the heap contents (including TOAST heap if any) down to disk. - * If we skipped using WAL, and WAL is otherwise needed, we must force the - * relation down to disk before it's safe to commit the transaction. This - * requires writing out any dirty buffers and then doing a forced fsync. - * - * Indexes are not touched. (Currently, index operations associated with - * the commands that use this are WAL-logged and so do not need fsync. - * That behavior might change someday, but in any case it's likely that - * any fsync decisions required would be per-index and hence not appropriate - * to be done here.) - */ -void -heap_sync(Relation rel) -{ - /* non-WAL-logged tables never need fsync */ - if (!RelationNeedsWAL(rel)) - return; - - /* main heap */ - FlushRelationBuffers(rel); - /* FlushRelationBuffers will have opened rd_smgr */ - smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM); - - /* FSM is not critical, don't bother syncing it */ - - /* toast heap, if any */ - if (OidIsValid(rel->rd_rel->reltoastrelid)) - { - Relation toastrel; - - toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock); - FlushRelationBuffers(toastrel); - smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM); - table_close(toastrel, AccessShareLock); - } -} - /* * Mask a heap page before performing consistency checks on it. */ diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 72729f744b..b3de2d37bf 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -555,17 +555,6 @@ tuple_lock_retry: return result; } -static void -heapam_finish_bulk_insert(Relation relation, int options) -{ - /* - * If we skipped writing WAL, then we need to sync the heap (but not - * indexes since those use WAL anyway / don't go through tableam) - */ - if (options & HEAP_INSERT_SKIP_WAL) - heap_sync(relation); -} - /* ------------------------------------------------------------------------ * DDL related callbacks for heap AM. @@ -698,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, IndexScanDesc indexScan; TableScanDesc tableScan; HeapScanDesc heapScan; - bool use_wal; bool is_system_catalog; Tuplesortstate *tuplesort; TupleDesc oldTupDesc = RelationGetDescr(OldHeap); @@ -713,12 +701,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, is_system_catalog = IsSystemRelation(OldHeap); /* - * We need to log the copied data in WAL iff WAL archiving/streaming is - * enabled AND it's a WAL-logged rel. + * Valid smgr_targblock implies something already wrote to the relation. + * This may be harmless, but this function hasn't planned for it. */ - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); - - /* use_wal off requires smgr_targblock be initially invalid */ Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); /* Preallocate values/isnull arrays */ @@ -728,7 +713,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, /* Initialize the rewrite operation */ rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff, - *multi_cutoff, use_wal); + *multi_cutoff); /* Set up sorting if wanted */ @@ -2515,7 +2500,6 @@ static const TableAmRoutine heapam_methods = { .tuple_delete = heapam_tuple_delete, .tuple_update = heapam_tuple_update, .tuple_lock = heapam_tuple_lock, - .finish_bulk_insert = heapam_finish_bulk_insert, .tuple_fetch_row_version = heapam_fetch_row_version, .tuple_get_latest_tid = heap_get_latest_tid, diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index d285b1f390..3e564838fa 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -136,7 +136,6 @@ typedef struct RewriteStateData Page rs_buffer; /* page currently being built */ BlockNumber rs_blockno; /* block where page will go */ bool rs_buffer_valid; /* T if any tuples in buffer */ - bool rs_use_wal; /* must we WAL-log inserts? */ bool rs_logical_rewrite; /* do we need to do logical rewriting */ TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine * tuple visibility */ @@ -230,15 +229,13 @@ static void logical_end_heap_rewrite(RewriteState state); * oldest_xmin xid used by the caller to determine which tuples are dead * freeze_xid xid before which tuples will be frozen * cutoff_multi multixact before which multis will be removed - * use_wal should the inserts to the new heap be WAL-logged? * * Returns an opaque RewriteState, allocated in current memory context, * to be used in subsequent calls to the other functions. */ RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin, - TransactionId freeze_xid, MultiXactId cutoff_multi, - bool use_wal) + TransactionId freeze_xid, MultiXactId cutoff_multi) { RewriteState state; MemoryContext rw_cxt; @@ -263,7 +260,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm /* new_heap needn't be empty, just locked */ state->rs_blockno = RelationGetNumberOfBlocks(new_heap); state->rs_buffer_valid = false; - state->rs_use_wal = use_wal; state->rs_oldest_xmin = oldest_xmin; state->rs_freeze_xid = freeze_xid; state->rs_cutoff_multi = cutoff_multi; @@ -322,7 +318,7 @@ end_heap_rewrite(RewriteState state) /* Write the last page, if any */ if (state->rs_buffer_valid) { - if (state->rs_use_wal) + if (RelationNeedsWAL(state->rs_new_rel)) log_newpage(&state->rs_new_rel->rd_node, MAIN_FORKNUM, state->rs_blockno, @@ -337,18 +333,14 @@ end_heap_rewrite(RewriteState state) } /* - * If the rel is WAL-logged, must fsync before commit. We use heap_sync - * to ensure that the toast table gets fsync'd too. - * - * It's obvious that we must do this when not WAL-logging. It's less - * obvious that we have to do it even if we did WAL-log the pages. The + * When we WAL-logged rel pages, we must nonetheless fsync them. The * reason is the same as in storage.c's RelationCopyStorage(): we're * writing data that's not in shared buffers, and so a CHECKPOINT * occurring during the rewriteheap operation won't have fsync'd data we * wrote before the checkpoint. */ if (RelationNeedsWAL(state->rs_new_rel)) - heap_sync(state->rs_new_rel); + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); logical_end_heap_rewrite(state); @@ -646,9 +638,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup) { int options = HEAP_INSERT_SKIP_FSM; - if (!state->rs_use_wal) - options |= HEAP_INSERT_SKIP_WAL; - /* * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data * for the TOAST table are not logically decoded. The main heap is @@ -687,7 +676,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup) /* Doesn't fit, so write out the existing page */ /* XLOG stuff */ - if (state->rs_use_wal) + if (RelationNeedsWAL(state->rs_new_rel)) log_newpage(&state->rs_new_rel->rd_node, MAIN_FORKNUM, state->rs_blockno, diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index c8110a130a..f419e92b35 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -31,18 +31,6 @@ * them. They will need to be re-read into shared buffers on first use after * the build finishes. * - * Since the index will never be used unless it is completely built, - * from a crash-recovery point of view there is no need to WAL-log the - * steps of the build. After completing the index build, we can just sync - * the whole file to disk using smgrimmedsync() before exiting this module. - * This can be seen to be sufficient for crash recovery by considering that - * it's effectively equivalent to what would happen if a CHECKPOINT occurred - * just after the index build. However, it is clearly not sufficient if the - * DBA is using the WAL log for PITR or replication purposes, since another - * machine would not be able to reconstruct the index from WAL. Therefore, - * we log the completed index pages to WAL if and only if WAL archiving is - * active. - * * This code isn't concerned about the FSM at all. The caller is responsible * for initializing that. * @@ -563,12 +551,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) wstate.heap = btspool->heap; wstate.index = btspool->index; wstate.inskey = _bt_mkscankey(wstate.index, NULL); - - /* - * We need to log index creation in WAL iff WAL archiving/streaming is - * enabled UNLESS the index isn't WAL-logged anyway. - */ - wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index); + wstate.btws_use_wal = RelationNeedsWAL(wstate.index); /* reserve the metapage */ wstate.btws_pages_alloced = BTREE_METAPAGE + 1; @@ -1265,21 +1248,15 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) _bt_uppershutdown(wstate, state); /* - * If the index is WAL-logged, we must fsync it down to disk before it's - * safe to commit the transaction. (For a non-WAL-logged index we don't - * care since the index will be uninteresting after a crash anyway.) - * - * It's obvious that we must do this when not WAL-logging the build. It's - * less obvious that we have to do it even if we did WAL-log the index - * pages. The reason is that since we're building outside shared buffers, - * a CHECKPOINT occurring during the build has no way to flush the - * previously written data to disk (indeed it won't know the index even - * exists). A crash later on would replay WAL from the checkpoint, - * therefore it wouldn't replay our earlier WAL entries. If we do not - * fsync those pages here, they might still not be on disk when the crash - * occurs. + * When we WAL-logged index pages, we must nonetheless fsync index files. + * Since we're building outside shared buffers, a CHECKPOINT occurring + * during the build has no way to flush the previously written data to + * disk (indeed it won't know the index even exists). A crash later on + * would replay WAL from the checkpoint, therefore it wouldn't replay our + * earlier WAL entries. If we do not fsync those pages here, they might + * still not be on disk when the crash occurs. */ - if (RelationNeedsWAL(wstate->index)) + if (wstate->btws_use_wal) { RelationOpenSmgr(wstate->index); smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM); diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c index eccb6fd942..48cda40ac0 100644 --- a/src/backend/access/rmgrdesc/gistdesc.c +++ b/src/backend/access/rmgrdesc/gistdesc.c @@ -80,6 +80,9 @@ gist_desc(StringInfo buf, XLogReaderState *record) case XLOG_GIST_PAGE_DELETE: out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec); break; + case XLOG_GIST_ASSIGN_LSN: + /* No details to write out */ + break; } } @@ -104,6 +107,8 @@ gist_identify(uint8 info) break; case XLOG_GIST_PAGE_DELETE: id = "PAGE_DELETE"; + case XLOG_GIST_ASSIGN_LSN: + id = "ASSIGN_LSN"; break; } diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index b5a2cb2de8..641809cfda 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -717,6 +717,40 @@ then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +Skipping WAL for New RelFileNode +-------------------------------- + +Under wal_level=minimal, if a change modifies a relfilenode that +RollbackAndReleaseCurrentSubTransaction() would unlink, in-tree access methods +write no WAL for that change. For any access method, CommitTransaction() +writes and fsyncs affected blocks before recording the commit. This skipping +is mandatory; if a WAL-writing change preceded a WAL-skipping change for the +same block, REDO could overwrite the WAL-skipping change. Code that writes +WAL without calling RelationNeedsWAL() must check for this case. + +If skipping were not mandatory, a related problem would arise. Suppose, under +full_page_writes=off, a WAL-writing change follows a WAL-skipping change. +When a WAL record contains no full-page image, REDO expects the page to match +its contents from just before record insertion. A WAL-skipping change may not +reach disk at all, violating REDO's expectation. + +Prefer to do the same in future access methods. However, two other approaches +can work. First, an access method can irreversibly transition a given fork +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and +smgrimmedsync(). Second, an access method can opt to write WAL +unconditionally for permanent relations. When using the second method, do not +call RelationCopyStorage(), which skips WAL. + +This applies only to WAL records whose replay would modify bytes stored in the +new relfilenode. It does not apply to other records about the relfilenode, +such as XLOG_SMGR_CREATE. Because it operates at the level of individual +relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations. +Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which +ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while +the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table +to skip WAL, but that won't affect its indexes. + + Asynchronous Commit ------------------- @@ -820,13 +854,12 @@ Changes to a temp table are not WAL-logged, hence could reach disk in advance of T1's commit, but we don't care since temp table contents don't survive crashes anyway. -Database writes made via any of the paths we have introduced to avoid WAL -overhead for bulk updates are also safe. In these cases it's entirely -possible for the data to reach disk before T1's commit, because T1 will -fsync it down to disk without any sort of interlock, as soon as it finishes -the bulk update. However, all these paths are designed to write data that -no other transaction can see until after T1 commits. The situation is thus -not different from ordinary WAL-logged updates. +Database writes that skip WAL for new relfilenodes are also safe. In these +cases it's entirely possible for the data to reach disk before T1's commit, +because T1 will fsync it down to disk without any sort of interlock. However, +all these paths are designed to write data that no other transaction can see +until after T1 commits. The situation is thus not different from ordinary +WAL-logged updates. Transaction Emulation during Recovery ------------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 5353b6ab0b..526825315c 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2109,6 +2109,13 @@ CommitTransaction(void) */ PreCommit_on_commit_actions(); + /* + * Synchronize files that are created and not WAL-logged during this + * transaction. This must happen before AtEOXact_RelationMap(), so that we + * don't see committed-but-broken files after a crash. + */ + smgrDoPendingSyncs(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2342,6 +2349,13 @@ PrepareTransaction(void) */ PreCommit_on_commit_actions(); + /* + * Synchronize files that are created and not WAL-logged during this + * transaction. This must happen before EndPrepare(), so that we don't see + * committed-but-broken files after a crash and COMMIT PREPARED. + */ + smgrDoPendingSyncs(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2660,6 +2674,7 @@ AbortTransaction(void) */ AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); + smgrDoPendingSyncs(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index aa9dca0036..dda1dea08b 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -1043,8 +1043,13 @@ log_newpage_range(Relation rel, ForkNumber forkNum, BlockNumber startblk, BlockNumber endblk, bool page_std) { + int flags; BlockNumber blkno; + flags = REGBUF_FORCE_IMAGE; + if (page_std) + flags |= REGBUF_STANDARD; + /* * Iterate over all the pages in the range. They are collected into * batches of XLR_MAX_BLOCK_ID pages, and a single WAL-record is written @@ -1066,7 +1071,8 @@ log_newpage_range(Relation rel, ForkNumber forkNum, nbufs = 0; while (nbufs < XLR_MAX_BLOCK_ID && blkno < endblk) { - Buffer buf = ReadBuffer(rel, blkno); + Buffer buf = ReadBufferExtended(rel, forkNum, blkno, + RBM_NORMAL, NULL); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); @@ -1088,7 +1094,7 @@ log_newpage_range(Relation rel, ForkNumber forkNum, START_CRIT_SECTION(); for (i = 0; i < nbufs; i++) { - XLogRegisterBuffer(i, bufpack[i], REGBUF_FORCE_IMAGE | REGBUF_STANDARD); + XLogRegisterBuffer(i, bufpack[i], flags); MarkBufferDirty(bufpack[i]); } diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 14efbf37d6..5889f4004b 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry; * fields related to physical storage, like rd_rel, are initialized, so the * fake entry is only usable in low-level operations like ReadBuffer(). * + * This is also used for syncing WAL-skipped files. + * * Caller must free the returned entry with FreeFakeRelcacheEntry(). */ Relation @@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode) FakeRelCacheEntry fakeentry; Relation rel; - Assert(InRecovery); - /* Allocate the Relation struct and all related space in one block. */ fakeentry = palloc0(sizeof(FakeRelCacheEntryData)); rel = (Relation) fakeentry; rel->rd_rel = &fakeentry->pgc; rel->rd_node = rnode; - /* We will never be working with temp rels during recovery */ + /* + * We will never be working with temp rels during recovery or while + * syncing WAL-skipped files. + */ rel->rd_backend = InvalidBackendId; - /* It must be a permanent table if we're in recovery. */ + /* It must be a permanent table here */ rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT; /* We don't know the name of the relation; use relfilenode instead */ @@ -572,9 +575,9 @@ CreateFakeRelcacheEntry(RelFileNode rnode) /* * We set up the lockRelId in case anything tries to lock the dummy * relation. Note that this is fairly bogus since relNode may be - * different from the relation's OID. It shouldn't really matter though, - * since we are presumably running by ourselves and can't have any lock - * conflicts ... + * different from the relation's OID. It shouldn't really matter though. + * In recovery, we are running by ourselves and can't have any lock + * conflicts. While syncing, we already hold AccessExclusiveLock. */ rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode; rel->rd_lockInfo.lockRelId.relId = rnode.relNode; diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index c9b3e17dc1..da05a827d8 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -440,6 +440,10 @@ heap_create(const char *relname, break; } } + else + { + rel->rd_createSubid = InvalidSubTransactionId; + } return rel; } diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 056ea3d5d3..fb34cf602a 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -29,9 +29,13 @@ #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" +#include "utils/hsearch.h" #include "utils/memutils.h" #include "utils/rel.h" +/* GUC variables */ +int wal_skip_threshold = 4096; /* in kilobytes */ + /* * We keep a list of all relations (represented as RelFileNode values) * that have been created or deleted in the current transaction. When @@ -61,7 +65,14 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +typedef struct pendingSync +{ + RelFileNode rnode; + BlockNumber max_truncated; +} pendingSync; + static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +HTAB *pendingSyncHash = NULL; /* * RelationCreateStorage @@ -117,6 +128,36 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending->next = pendingDeletes; pendingDeletes = pending; + /* + * If the relation needs at-commit sync, we also need to track the maximum + * unsynced truncated block used to decide whether we can WAL-logging or we + * must sync the file in smgrDoPendingSyncs. + */ + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) + { + pendingSync *pending; + bool found; + + /* we sync only permanent relations */ + Assert(backend == InvalidBackendId); + + if (!pendingSyncHash) + { + HASHCTL ctl; + + ctl.keysize = sizeof(RelFileNode); + ctl.entrysize = sizeof(pendingSync); + ctl.hcxt = TopTransactionContext; + pendingSyncHash = + hash_create("max truncatd block hash", + 16, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + } + + pending = hash_search(pendingSyncHash, &rnode, HASH_ENTER, &found); + Assert(!found); + pending->max_truncated = InvalidBlockNumber; + } + return srel; } @@ -312,6 +353,22 @@ RelationTruncate(Relation rel, BlockNumber nblocks) if (fsm || vm) XLogFlush(lsn); } + else if (pendingSyncHash) + { + pendingSync *pending; + + /* Record largest maybe-unsynced block of files under tracking */ + pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node), + HASH_FIND, NULL); + if (pending) + { + BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM); + + if (!BlockNumberIsValid(pending->max_truncated) || + pending->max_truncated < nblocks) + pending->max_truncated = nblocks; + } + } /* Do the real work to truncate relation forks */ smgrtruncate(rel->rd_smgr, forks, nforks, blocks); @@ -355,7 +412,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, /* * We need to log the copied data in WAL iff WAL archiving/streaming is - * enabled AND it's a permanent relation. + * enabled AND it's a permanent relation. This gives the same answer as + * "RelationNeedsWAL(rel) || copying_initfork", because we know the + * current operation created a new relfilenode. */ use_wal = XLogIsNeeded() && (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); @@ -397,24 +456,42 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst, } /* - * If the rel is WAL-logged, must fsync before commit. We use heap_sync - * to ensure that the toast table gets fsync'd too. (For a temp or - * unlogged rel we don't care since the data will be gone after a crash - * anyway.) - * - * It's obvious that we must do this when not WAL-logging the copy. It's - * less obvious that we have to do it even if we did WAL-log the copied - * pages. The reason is that since we're copying outside shared buffers, a - * CHECKPOINT occurring during the copy has no way to flush the previously - * written data to disk (indeed it won't know the new rel even exists). A - * crash later on would replay WAL from the checkpoint, therefore it - * wouldn't replay our earlier WAL entries. If we do not fsync those pages - * here, they might still not be on disk when the crash occurs. + * When we WAL-logged rel pages, we must nonetheless fsync them. The + * reason is that since we're copying outside shared buffers, a CHECKPOINT + * occurring during the copy has no way to flush the previously written + * data to disk (indeed it won't know the new rel even exists). A crash + * later on would replay WAL from the checkpoint, therefore it wouldn't + * replay our earlier WAL entries. If we do not fsync those pages here, + * they might still not be on disk when the crash occurs. */ - if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork) + if (use_wal || copying_initfork) smgrimmedsync(dst, forkNum); } +/* + * RelFileNodeSkippingWAL - check if a BM_PERMANENT relfilenode is using WAL + * + * Changes of certain relfilenodes must not write WAL; see "Skipping WAL for + * New RelFileNode" in src/backend/access/transam/README. Though it is + * known from Relation efficiently, this function is intended for the code + * paths not having access to Relation. + */ +bool +RelFileNodeSkippingWAL(RelFileNode rnode) +{ + if (XLogIsNeeded()) + return false; /* no permanent relfilenode skips WAL */ + + if (!pendingSyncHash) + return false; /* we don't have a to-be-synced relation */ + + /* the relation is not tracked as to-be-synced */ + if (hash_search(pendingSyncHash, &rnode, HASH_FIND, NULL) == NULL) + return false; + + return true; +} + /* * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. * @@ -492,6 +569,156 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingSyncs() -- Take care of relation syncs at commit. + * + * This should be called before smgrDoPendingDeletes() at every commit or + * prepare. Also this should be called before emitting WAL record so that sync + * failure prevents commit. + */ +void +smgrDoPendingSyncs(bool isCommit) +{ + PendingRelDelete *pending; + int nrels = 0, + maxrels = 0; + SMgrRelation *srels = NULL; + HASH_SEQ_STATUS scan; + pendingSync *pendingsync; + + if (XLogIsNeeded()) + return; /* no relation can use this */ + + Assert(GetCurrentTransactionNestLevel() == 1); + AssertPendingSyncs_RelationCache(); + + if (!pendingSyncHash) + return; /* no relation needs sync */ + + /* Just throw away all pending syncs if any at rollback */ + if (!isCommit) + { + if (pendingSyncHash) + { + hash_destroy(pendingSyncHash); + pendingSyncHash = NULL; + } + return; + } + + /* + * Pending syncs on the relation that are to be deleted in this + * transaction-end should be ignored. Remove sync hash entries entries for + * relations that will be deleted in the following call to + * smgrDoPendingDeletes(). + */ + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + if (!pending->atCommit) + continue; + + (void) hash_search(pendingSyncHash, (void *) &pending->relnode, + HASH_REMOVE, NULL); + } + + hash_seq_init(&scan, pendingSyncHash); + while ((pendingsync = (pendingSync *) hash_seq_search(&scan))) + { + ForkNumber fork; + BlockNumber nblocks[MAX_FORKNUM + 1]; + BlockNumber total_blocks = 0; + SMgrRelation srel; + + srel = smgropen(pendingsync->rnode, InvalidBackendId); + + /* + * We emit newpage WAL records for smaller relations. + * + * Small WAL records have a chance to be emitted along with other + * backends' WAL records. We emit WAL records instead of syncing for + * files that are smaller than a certain threshold, expecting faster + * commit. The threshold is defined by the GUC wal_skip_threshold. + */ + for (fork = 0 ; fork <= MAX_FORKNUM ; fork++) + { + if (smgrexists(srel, fork)) + { + BlockNumber n = smgrnblocks(srel, fork); + + /* we shouldn't come here for unlogged relations */ + Assert(fork != INIT_FORKNUM); + + nblocks[fork] = n; + total_blocks += n; + } + else + nblocks[fork] = InvalidBlockNumber; + } + + /* + * Sync file or emit WAL record for the file according to the total + * size. Do file sync if the size is larger than the threshold or + * truncates may have left blocks beyond the current size. + */ + if (total_blocks * BLCKSZ >= wal_skip_threshold * 1024 || + (BlockNumberIsValid(pendingsync->max_truncated) && + smgrnblocks(srel, MAIN_FORKNUM) < pendingsync->max_truncated)) + { + /* relations to sync are passed to smgrdosyncall at once */ + + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; + } + else + { + /* + * Emit WAL records for all blocks. We don't emit + * XLOG_SMGR_TRUNCATE record because the past truncations haven't + * left unlogged pages here. + */ + for (fork = 0 ; fork <= MAX_FORKNUM ; fork++) + { + int n = nblocks[fork]; + Relation rel; + + if (!BlockNumberIsValid(n)) + continue; + + /* + * Emit WAL for the whole file. Unfortunately we don't know + * what kind of a page this is, so we have to log the full + * page including any unused space. ReadBufferExtended() + * counts some pgstat events; unfortunately, we discard them. + */ + rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node); + log_newpage_range(rel, fork, 0, n, false); + FreeFakeRelcacheEntry(rel); + } + } + } + + Assert (pendingSyncHash); + hash_destroy(pendingSyncHash); + pendingSyncHash = NULL; + + if (nrels > 0) + { + smgrdosyncall(srels, nrels); + pfree(srels); + } +} + /* * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted. * diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index cc35811dc8..de8e5a43d9 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1014,6 +1014,8 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, relfilenode2; Oid swaptemp; char swptmpchr; + Relation rel1; + Relation rel2; /* We need writable copies of both pg_class tuples. */ relRelation = table_open(RelationRelationId, RowExclusiveLock); @@ -1039,6 +1041,7 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, */ Assert(!target_is_pg_class); + /* swap relfilenodes, reltablespaces, relpersistence */ swaptemp = relform1->relfilenode; relform1->relfilenode = relform2->relfilenode; relform2->relfilenode = swaptemp; @@ -1173,6 +1176,34 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, CacheInvalidateRelcacheByTuple(reltup2); } + /* + * Recognize that rel1's relfilenode (swapped from rel2) is new in this + * subtransaction. However the next step for rel2 is deletion, we need to + * turn off the newness of its relfilenode, that allows the relcache to be + * flushed. Requried lock must be held before getting here so we take + * AccessShareLock in case no lock is acquired. Since command counter is + * not advanced the relcache entries has the contens before the above + * updates. We don't bother incrementing it and swap their contents + * directly. + */ + rel1 = relation_open(r1, AccessShareLock); + rel2 = relation_open(r2, AccessShareLock); + + /* swap relfilenodes */ + rel1->rd_node.relNode = relfilenode2; + rel2->rd_node.relNode = relfilenode1; + + /* + * Adjust newness flags. relfilenode2 is already added to EOXact array so + * we don't need to do that again here. We assume the new file is created + * in the current subtransaction. + */ + RelationAssumeNewRelfilenode(rel1); + rel2->rd_createSubid = InvalidSubTransactionId; + + relation_close(rel1, AccessShareLock); + relation_close(rel2, AccessShareLock); + /* * Post alter hook for modified relations. The change to r2 is always * internal, but r1 depends on the invocation context. diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 42a147b67d..607e2558a3 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2711,63 +2711,15 @@ CopyFrom(CopyState cstate) RelationGetRelationName(cstate->rel)))); } - /*---------- - * Check to see if we can avoid writing WAL - * - * If archive logging/streaming is not enabled *and* either - * - table was created in same transaction as this COPY - * - data is being written to relfilenode created in this transaction - * then we can skip writing WAL. It's safe because if the transaction - * doesn't commit, we'll discard the table (or the new relfilenode file). - * If it does commit, we'll have done the table_finish_bulk_insert() at - * the bottom of this routine first. - * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ... - * - * Also, if the target file is new-in-transaction, we assume that checking - * FSM for free space is a waste of time, even if we must use WAL because - * of archiving. This could possibly be wrong, but it's unlikely. - * - * The comments for table_tuple_insert and RelationGetBufferForTuple - * specify that skipping WAL logging is only safe if we ensure that our - * tuples do not go into pages containing tuples from any other - * transactions --- but this must be the case if we have a new table or - * new relfilenode, so we need no additional work to enforce that. - * - * We currently don't support this optimization if the COPY target is a - * partitioned table as we currently only lazily initialize partition - * information when routing the first tuple to the partition. We cannot - * know at this stage if we can perform this optimization. It should be - * possible to improve on this, but it does mean maintaining heap insert - * option flags per partition and setting them when we first open the - * partition. - * - * This optimization is not supported for relation types which do not - * have any physical storage, with foreign tables and views using - * INSTEAD OF triggers entering in this category. Partitioned tables - * are not supported as per the description above. - *---------- + /* + * If the target file is new-in-transaction, we assume that checking FSM + * for free space is a waste of time. This could possibly be wrong, but + * it's unlikely. */ - /* createSubid is creation check, newRelfilenodeSubid is truncation check */ if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) && (cstate->rel->rd_createSubid != InvalidSubTransactionId || - cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)) - { + cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId)) ti_options |= TABLE_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) - ti_options |= TABLE_INSERT_SKIP_WAL; - } /* * Optimize if new relfilenode was created in this subxact or one of its diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index 2bf7083719..20225dc62f 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -552,16 +552,13 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) myState->rel = intoRelationDesc; myState->reladdr = intoRelationAddr; myState->output_cid = GetCurrentCommandId(true); + myState->ti_options = TABLE_INSERT_SKIP_FSM; + myState->bistate = GetBulkInsertState(); /* - * We can skip WAL-logging the insertions, unless PITR or streaming - * replication is in use. We can skip the FSM in any case. + * Valid smgr_targblock implies something already wrote to the relation. + * This may be harmless, but this function hasn't planned for it. */ - myState->ti_options = TABLE_INSERT_SKIP_FSM | - (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL); - myState->bistate = GetBulkInsertState(); - - /* Not using WAL requires smgr_targblock be initially invalid */ Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber); } diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c index 907c71dda0..823d663f52 100644 --- a/src/backend/commands/matview.c +++ b/src/backend/commands/matview.c @@ -457,17 +457,13 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) */ myState->transientrel = transientrel; myState->output_cid = GetCurrentCommandId(true); - - /* - * We can skip WAL-logging the insertions, unless PITR or streaming - * replication is in use. We can skip the FSM in any case. - */ myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; - if (!XLogIsNeeded()) - myState->ti_options |= TABLE_INSERT_SKIP_WAL; myState->bistate = GetBulkInsertState(); - /* Not using WAL requires smgr_targblock be initially invalid */ + /* + * Valid smgr_targblock implies something already wrote to the relation. + * This may be harmless, but this function hasn't planned for it. + */ Assert(RelationGetTargetBlock(transientrel) == InvalidBlockNumber); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index a776e652f4..c949ce259c 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -4766,19 +4766,14 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) newrel = NULL; /* - * Prepare a BulkInsertState and options for table_tuple_insert. Because - * we're building a new heap, we can skip WAL-logging and fsync it to disk - * at the end instead (unless WAL-logging is required for archiving or - * streaming replication). The FSM is empty too, so don't bother using it. + * Prepare a BulkInsertState and options for table_tuple_insert. The FSM + * is empty, so don't bother using it. */ if (newrel) { mycid = GetCurrentCommandId(true); bistate = GetBulkInsertState(); - ti_options = TABLE_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) - ti_options |= TABLE_INSERT_SKIP_WAL; } else { @@ -12432,6 +12427,8 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) table_close(pg_class, RowExclusiveLock); + RelationAssumeNewRelfilenode(rel); + relation_close(rel, NoLock); /* Make sure the reltablespace change is visible */ diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 1f10a97dc7..1761d733a1 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -105,6 +105,19 @@ typedef struct CkptTsStatus int index; } CkptTsStatus; +/* + * Type for array used to sort SMgrRelations + * + * FlushRelFileNodesAllBuffers shares the same comparator function with + * DropRelFileNodeBuffers. Pointer to this struct and RelFileNode must + * be compatible. + */ +typedef struct SMgrSortArray +{ + RelFileNode rnode; /* This must be the first member */ + SMgrRelation srel; +} SMgrSortArray; + /* GUC variables */ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; @@ -3293,6 +3306,106 @@ FlushRelationBuffers(Relation rel) } } +/* --------------------------------------------------------------------- + * FlushRelFileNodesAllBuffers + * + * This function flushes out the buffer pool all the pages of all + * forks of the specified smgr relations. It's equivalent to + * calling FlushRelationBuffers once per fork per relation, but the + * parameter is not Relation but SMgrRelation + * -------------------------------------------------------------------- + */ +void +FlushRelFileNodesAllBuffers(SMgrRelation *smgrs, int nrels) +{ + int i; + SMgrSortArray *srels; + bool use_bsearch; + + if (nrels == 0) + return; + + /* fill-in array for qsort */ + srels = palloc(sizeof(SMgrSortArray) * nrels); + + for (i = 0 ; i < nrels ; i++) + { + Assert (!RelFileNodeBackendIsTemp(smgrs[i]->smgr_rnode)); + + srels[i].rnode = smgrs[i]->smgr_rnode.node; + srels[i].srel = smgrs[i]; + } + + /* + * Save the bsearch overhead for low number of relations to + * sync. See DropRelFileNodesAllBuffers for details. The name DROP_* + * is for historical reasons. + */ + use_bsearch = nrels > DROP_RELS_BSEARCH_THRESHOLD; + + /* sort the list of SMgrRelations if necessary */ + if (use_bsearch) + pg_qsort(srels, nrels, sizeof(SMgrSortArray), rnode_comparator); + + /* Make sure we can handle the pin inside the loop */ + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + SMgrSortArray *srelent = NULL; + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + /* + * As in DropRelFileNodeBuffers, an unlocked precheck should be safe + * and saves some cycles. + */ + + if (!use_bsearch) + { + int j; + + for (j = 0; j < nrels; j++) + { + if (RelFileNodeEquals(bufHdr->tag.rnode, srels[j].rnode)) + { + srelent = &srels[j]; + break; + } + } + + } + else + { + srelent = bsearch((const void *) &(bufHdr->tag.rnode), + srels, nrels, sizeof(SMgrSortArray), + rnode_comparator); + } + + /* buffer doesn't belong to any of the given relfilenodes; skip it */ + if (srelent == NULL) + continue; + + /* Ensure there's a free array slot for PinBuffer_Locked */ + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + if (RelFileNodeEquals(bufHdr->tag.rnode, srelent->rnode) && + (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); + FlushBuffer(bufHdr, srelent->srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + + pfree(srels); +} + /* --------------------------------------------------------------------- * FlushDatabaseBuffers * @@ -3494,13 +3607,15 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std) (pg_atomic_read_u32(&bufHdr->state) & BM_PERMANENT)) { /* - * If we're in recovery we cannot dirty a page because of a hint. - * We can set the hint, just not dirty the page as a result so the - * hint is lost when we evict the page or shutdown. + * If we must not write WAL, due to a relfilenode-specific + * condition or being in recovery, don't dirty the page. We can + * set the hint, just not dirty the page as a result so the hint + * is lost when we evict the page or shutdown. * * See src/backend/storage/page/README for longer discussion. */ - if (RecoveryInProgress()) + if (RecoveryInProgress() || + RelFileNodeSkippingWAL(bufHdr->tag.rnode)) return; /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 82442db046..15081660bd 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -248,11 +248,10 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) * During replay, we would delete the file and then recreate it, which is fine * if the contents of the file were repopulated by subsequent WAL entries. * But if we didn't WAL-log insertions, but instead relied on fsyncing the - * file after populating it (as for instance CLUSTER and CREATE INDEX do), - * the contents of the file would be lost forever. By leaving the empty file - * until after the next checkpoint, we prevent reassignment of the relfilenode - * number until it's safe, because relfilenode assignment skips over any - * existing file. + * file after populating it (as we do at wal_level=minimal), the contents of + * the file would be lost forever. By leaving the empty file until after the + * next checkpoint, we prevent reassignment of the relfilenode number until + * it's safe, because relfilenode assignment skips over any existing file. * * We do not need to go through this dance for temp relations, though, because * we never make WAL entries for temp rels, and so a temp rel poses no threat @@ -891,6 +890,7 @@ void mdimmedsync(SMgrRelation reln, ForkNumber forknum) { int segno; + int min_inactive_seg; /* * NOTE: mdnblocks makes sure we have opened all active segments, so that @@ -898,19 +898,42 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum) */ mdnblocks(reln, forknum); - segno = reln->md_num_open_segs[forknum]; + min_inactive_seg = segno = reln->md_num_open_segs[forknum]; + + /* + * We need to sync all segments including inactive ones here. Temporarily + * open them then close after sync. There may be some inactive segments + * left opened after fsync error but it actually doesn't harm and we don't + * bother clean them up taking a risk of further trouble. + */ + while (_mdfd_openseg(reln, forknum, segno, 0) != NULL) + segno++; while (segno > 0) { MdfdVec *v = &reln->md_seg_fds[forknum][segno - 1]; if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0) + { ereport(data_sync_elevel(ERROR), (errcode_for_file_access(), errmsg("could not fsync file \"%s\": %m", FilePathName(v->mdfd_vfd)))); + } + + /* Close inactive segments immediately */ + if (segno > min_inactive_seg) + { + FileClose(v->mdfd_vfd); + v->mdfd_vfd = -1; + } + segno--; } + + /* shrink fdvec if needed */ + if (min_inactive_seg < reln->md_num_open_segs[forknum]) + _fdvec_resize(reln, forknum, min_inactive_seg); } /* diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index b50c69b438..191b52ab43 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -388,6 +388,43 @@ smgrdounlink(SMgrRelation reln, bool isRedo) smgrsw[which].smgr_unlink(rnode, InvalidForkNumber, isRedo); } +/* + * smgrdosyncall() -- Immediately sync all forks of all given relations + * + * All forks of all given relations are syncd out to the store. + * + * This is equivalent to flusing all buffers FlushRelationBuffers for each + * smgr relation then calling smgrimmedsync for all forks of each smgr + * relation, but it's significantly quicker so should be preferred when + * possible. + */ +void +smgrdosyncall(SMgrRelation *rels, int nrels) +{ + int i = 0; + ForkNumber forknum; + + if (nrels == 0) + return; + + /* We need to flush all buffers for the relations before sync. */ + FlushRelFileNodesAllBuffers(rels, nrels); + + /* + * Sync the physical file(s). + */ + for (i = 0; i < nrels; i++) + { + int which = rels[i]->smgr_which; + + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) + { + if (smgrsw[which].smgr_exists(rels[i], forknum)) + smgrsw[which].smgr_immedsync(rels[i], forknum); + } + } +} + /* * smgrdounlinkall() -- Immediately unlink all forks of all given relations * diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 50f8912c13..e9da83d41e 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -262,6 +262,9 @@ static void RelationReloadIndexInfo(Relation relation); static void RelationReloadNailed(Relation relation); static void RelationFlushRelation(Relation relation); static void RememberToFreeTupleDescAtEOX(TupleDesc td); +#ifdef USE_ASSERT_CHECKING +static void AssertPendingSyncConsistency(Relation relation); +#endif static void AtEOXact_cleanup(Relation relation, bool isCommit); static void AtEOSubXact_cleanup(Relation relation, bool isCommit, SubTransactionId mySubid, SubTransactionId parentSubid); @@ -1095,6 +1098,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt) relation->rd_isnailed = false; relation->rd_createSubid = InvalidSubTransactionId; relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; switch (relation->rd_rel->relpersistence) { case RELPERSISTENCE_UNLOGGED: @@ -1828,6 +1832,7 @@ formrdesc(const char *relationName, Oid relationReltype, relation->rd_isnailed = true; relation->rd_createSubid = InvalidSubTransactionId; relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; relation->rd_backend = InvalidBackendId; relation->rd_islocaltemp = false; @@ -2035,6 +2040,12 @@ RelationIdGetRelation(Oid relationId) rd = RelationBuildDesc(relationId, true); if (RelationIsValid(rd)) RelationIncrementReferenceCount(rd); + +#ifdef USE_ASSERT_CHECKING + if (!XLogIsNeeded() && RelationIsValid(rd)) + AssertPendingSyncConsistency(rd); +#endif + return rd; } @@ -2093,7 +2104,7 @@ RelationClose(Relation relation) #ifdef RELCACHE_FORCE_RELEASE if (RelationHasReferenceCountZero(relation) && relation->rd_createSubid == InvalidSubTransactionId && - relation->rd_newRelfilenodeSubid == InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId) RelationClearRelation(relation, false); #endif } @@ -2509,13 +2520,13 @@ RelationClearRelation(Relation relation, bool rebuild) * problem. * * When rebuilding an open relcache entry, we must preserve ref count, - * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also - * attempt to preserve the pg_class entry (rd_rel), tupledesc, - * rewrite-rule, partition key, and partition descriptor substructures - * in place, because various places assume that these structures won't - * move while they are working with an open relcache entry. (Note: - * the refcount mechanism for tupledescs might someday allow us to - * remove this hack for the tupledesc.) + * rd_*Subid, and rd_toastoid state. Also attempt to preserve the + * pg_class entry (rd_rel), tupledesc, rewrite-rule, partition key, + * and partition descriptor substructures in place, because various + * places assume that these structures won't move while they are + * working with an open relcache entry. (Note: the refcount + * mechanism for tupledescs might someday allow us to remove this hack + * for the tupledesc.) * * Note that this process does not touch CurrentResourceOwner; which * is good because whatever ref counts the entry may have do not @@ -2599,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild) /* creation sub-XIDs must be preserved */ SWAPFIELD(SubTransactionId, rd_createSubid); SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid); + SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid); /* un-swap rd_rel pointers, swap contents instead */ SWAPFIELD(Form_pg_class, rd_rel); /* ... but actually, we don't have to update newrel->rd_rel */ @@ -2666,7 +2678,7 @@ static void RelationFlushRelation(Relation relation) { if (relation->rd_createSubid != InvalidSubTransactionId || - relation->rd_newRelfilenodeSubid != InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId) { /* * New relcache entries are always rebuilt, not flushed; else we'd @@ -2751,11 +2763,10 @@ RelationCacheInvalidateEntry(Oid relationId) * relation cache and re-read relation mapping data. * * This is currently used only to recover from SI message buffer overflow, - * so we do not touch new-in-transaction relations; they cannot be targets - * of cross-backend SI updates (and our own updates now go through a - * separate linked list that isn't limited by the SI message buffer size). - * Likewise, we need not discard new-relfilenode-in-transaction hints, - * since any invalidation of those would be a local event. + * so we do not touch relations having new-in-transaction relfilenodes; they + * cannot be targets of cross-backend SI updates (and our own updates now go + * through a separate linked list that isn't limited by the SI message + * buffer size). * * We do this in two phases: the first pass deletes deletable items, and * the second one rebuilds the rebuildable items. This is essential for @@ -2806,7 +2817,7 @@ RelationCacheInvalidate(void) * pending invalidations. */ if (relation->rd_createSubid != InvalidSubTransactionId || - relation->rd_newRelfilenodeSubid != InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId) continue; relcacheInvalsReceived++; @@ -2918,6 +2929,40 @@ RememberToFreeTupleDescAtEOX(TupleDesc td) EOXactTupleDescArray[NextEOXactTupleDescNum++] = td; } +#ifdef USE_ASSERT_CHECKING +static void +AssertPendingSyncConsistency(Relation relation) +{ + bool relcache_verdict = + relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && + ((relation->rd_createSubid != InvalidSubTransactionId && + RELKIND_HAS_STORAGE(relation->rd_rel->relkind)) || + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId); + Assert(relcache_verdict == RelFileNodeSkippingWAL(relation->rd_node)); +} + +/* + * AssertPendingSyncs_RelationCache + * + * Assert that relcache.c and storage.c agree on whether to skip WAL. + * + * This consistently detects relcache.c skipping WAL while storage.c is not + * skipping WAL. It often fails to detect the reverse error, because + * invalidation will have destroyed the relcache entry. It will detect the + * reverse error if something opens the relation after the DDL. + */ +void +AssertPendingSyncs_RelationCache(void) +{ + HASH_SEQ_STATUS status; + RelIdCacheEnt *idhentry; + + hash_seq_init(&status, RelationIdCache); + while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL) + AssertPendingSyncConsistency(idhentry->reldesc); +} +#endif + /* * AtEOXact_RelationCache * @@ -3029,10 +3074,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit) * * During commit, reset the flag to zero, since we are now out of the * creating transaction. During abort, simply delete the relcache entry - * --- it isn't interesting any longer. (NOTE: if we have forgotten the - * new-ness of a new relation due to a forced cache flush, the entry will - * get deleted anyway by shared-cache-inval processing of the aborted - * pg_class insertion.) + * --- it isn't interesting any longer. */ if (relation->rd_createSubid != InvalidSubTransactionId) { @@ -3060,9 +3102,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit) } /* - * Likewise, reset the hint about the relfilenode being new. + * Likewise, reset any record of the relfilenode being new. */ relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; } /* @@ -3154,7 +3197,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, } /* - * Likewise, update or drop any new-relfilenode-in-subtransaction hint. + * Likewise, update or drop any new-relfilenode-in-subtransaction. */ if (relation->rd_newRelfilenodeSubid == mySubid) { @@ -3163,6 +3206,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, else relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; } + + if (relation->rd_firstRelfilenodeSubid == mySubid) + { + if (isCommit) + relation->rd_firstRelfilenodeSubid = parentSubid; + else + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; + } } @@ -3252,6 +3303,7 @@ RelationBuildLocalRelation(const char *relname, /* it's being created in this transaction */ rel->rd_createSubid = GetCurrentSubTransactionId(); rel->rd_newRelfilenodeSubid = InvalidSubTransactionId; + rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId; /* * create a new tuple descriptor from the one passed in. We do this @@ -3549,14 +3601,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence) */ CommandCounterIncrement(); - /* - * Mark the rel as having been given a new relfilenode in the current - * (sub) transaction. This is a hint that can be used to optimize later - * operations on the rel in the same transaction. - */ + RelationAssumeNewRelfilenode(relation); +} + +/* + * RelationAssumeNewRelfilenode + * + * Code that modifies pg_class.reltablespace or pg_class.relfilenode must call + * this. The call shall precede any code that might insert WAL records whose + * replay would modify bytes in the new RelFileNode, and the call shall follow + * any WAL modifying bytes in the prior RelFileNode. See struct RelationData. + * Ideally, call this as near as possible to the CommandCounterIncrement() + * that makes the pg_class change visible (before it or after it); that + * minimizes the chance of future development adding a forbidden WAL insertion + * between RelationAssumeNewRelfilenode() and CommandCounterIncrement(). + */ +void +RelationAssumeNewRelfilenode(Relation relation) +{ relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId(); + if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid; - /* Flag relation as needing eoxact cleanup (to remove the hint) */ + /* Flag relation as needing eoxact cleanup (to clear these fields) */ EOXactListAdd(relation); } @@ -5642,6 +5709,7 @@ load_relcache_init_file(bool shared) rel->rd_fkeylist = NIL; rel->rd_createSubid = InvalidSubTransactionId; rel->rd_newRelfilenodeSubid = InvalidSubTransactionId; + rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId; rel->rd_amcache = NULL; MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info)); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 4b4911d5ec..34b0e6d5fc 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -36,6 +36,7 @@ #include "access/xlog_internal.h" #include "catalog/namespace.h" #include "catalog/pg_authid.h" +#include "catalog/storage.h" #include "commands/async.h" #include "commands/prepare.h" #include "commands/trigger.h" @@ -2661,6 +2662,18 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"wal_skip_threshold", PGC_USERSET, RESOURCES_DISK, + gettext_noop("Minimum size of new file to fsync instead of writing WAL when wal_level = minimal in kilobytes."), + NULL, + GUC_UNIT_KB + }, + &wal_skip_threshold, + 4096, + 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + { {"max_wal_senders", PGC_POSTMASTER, REPLICATION_SENDING, gettext_noop("Sets the maximum number of simultaneously running WAL sender processes."), diff --git a/src/bin/psql/input.c b/src/bin/psql/input.c index 5798e6e7d6..5d6878077e 100644 --- a/src/bin/psql/input.c +++ b/src/bin/psql/input.c @@ -163,6 +163,7 @@ pg_send_history(PQExpBuffer history_buf) prev_hist = pg_strdup(s); /* And send it to readline */ add_history(s); + fprintf(stderr, "H(%s)", s); /* Count lines added to history for use later */ history_lines_added++; } diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h index b89107d09e..ce1ddac01d 100644 --- a/src/include/access/gist_private.h +++ b/src/include/access/gist_private.h @@ -455,6 +455,8 @@ extern XLogRecPtr gistXLogSplit(bool page_is_leaf, BlockNumber origrlink, GistNSN oldnsn, Buffer leftchild, bool markfollowright); +extern XLogRecPtr gistXLogAssignLSN(void); + /* gistget.c */ extern bool gistgettuple(IndexScanDesc scan, ScanDirection dir); extern int64 gistgetbitmap(IndexScanDesc scan, TIDBitmap *tbm); diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h index e44922d915..1eae06c0fb 100644 --- a/src/include/access/gistxlog.h +++ b/src/include/access/gistxlog.h @@ -26,6 +26,7 @@ /* #define XLOG_GIST_INSERT_COMPLETE 0x40 */ /* not used anymore */ /* #define XLOG_GIST_CREATE_INDEX 0x50 */ /* not used anymore */ #define XLOG_GIST_PAGE_DELETE 0x60 +#define XLOG_GIST_ASSIGN_LSN 0x70 /* nop, assign an new LSN */ /* * Backup Blk 0: updated page. diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 858bcb6bc9..22916e8e0e 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -29,7 +29,6 @@ /* "options" flag bits for heap_insert */ -#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL #define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM #define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN #define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL @@ -166,8 +165,6 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid); extern void simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup); -extern void heap_sync(Relation relation); - extern TransactionId heap_compute_xid_horizon_for_tuples(Relation rel, ItemPointerData *items, int nitems); diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h index 8056253916..7f9736e294 100644 --- a/src/include/access/rewriteheap.h +++ b/src/include/access/rewriteheap.h @@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState; extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap, TransactionId OldestXmin, TransactionId FreezeXid, - MultiXactId MultiXactCutoff, bool use_wal); + MultiXactId MultiXactCutoff); extern void end_heap_rewrite(RewriteState state); extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple, HeapTuple newTuple); diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 64022917e2..aca88d0620 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -127,7 +127,7 @@ typedef struct TM_FailureData } TM_FailureData; /* "options" flag bits for table_tuple_insert */ -#define TABLE_INSERT_SKIP_WAL 0x0001 +/* TABLE_INSERT_SKIP_WAL was 0x0001; RelationNeedsWAL() now governs */ #define TABLE_INSERT_SKIP_FSM 0x0002 #define TABLE_INSERT_FROZEN 0x0004 #define TABLE_INSERT_NO_LOGICAL 0x0008 @@ -409,9 +409,8 @@ typedef struct TableAmRoutine /* * Perform operations necessary to complete insertions made via - * tuple_insert and multi_insert with a BulkInsertState specified. This - * may for example be used to flush the relation, when the - * TABLE_INSERT_SKIP_WAL option was used. + * tuple_insert and multi_insert with a BulkInsertState specified. In-tree + * access methods ceased to use this. * * Typically callers of tuple_insert and multi_insert will just pass all * the flags that apply to them, and each AM has to decide which of them @@ -1087,10 +1086,6 @@ table_compute_xid_horizon_for_tuples(Relation rel, * The options bitmask allows the caller to specify options that may change the * behaviour of the AM. The AM will ignore options that it does not support. * - * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't - * need to be logged to WAL, even for a non-temp relation. It is the AMs - * choice whether this optimization is supported. - * * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse * free space in the relation. This can save some cycles when we know the * relation is new and doesn't contain useful amounts of free space. @@ -1309,10 +1304,9 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot, } /* - * Perform operations necessary to complete insertions made via - * tuple_insert and multi_insert with a BulkInsertState specified. This - * e.g. may e.g. used to flush the relation when inserting with - * TABLE_INSERT_SKIP_WAL specified. + * Perform operations necessary to complete insertions made via tuple_insert + * and multi_insert with a BulkInsertState specified. In-tree access methods + * ceased to use this. */ static inline void table_finish_bulk_insert(Relation rel, int options) diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 3579d3f3eb..bf076657e7 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -19,18 +19,23 @@ #include "storage/smgr.h" #include "utils/relcache.h" +/* GUC variables */ +extern int wal_skip_threshold; + extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationTruncate(Relation rel, BlockNumber nblocks); extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst, ForkNumber forkNum, char relpersistence); +extern bool RelFileNodeSkippingWAL(RelFileNode rnode); /* * These functions used to be in storage/smgr/smgr.c, which explains the * naming */ extern void smgrDoPendingDeletes(bool isCommit); +extern void smgrDoPendingSyncs(bool isCommit); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 17b97f7e38..3f85e8c6fe 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -49,6 +49,9 @@ typedef enum /* forward declared, to avoid having to expose buf_internals.h here */ struct WritebackContext; +/* forward declared, to avoid including smgr.h here */ +struct SMgrRelationData; + /* in globals.c ... this duplicates miscadmin.h */ extern PGDLLIMPORT int NBuffers; @@ -192,6 +195,7 @@ extern void FlushRelationBuffers(Relation rel); extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void FlushRelFileNodesAllBuffers(struct SMgrRelationData **smgrs, int nrels); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 1543d8d870..31a5ecd059 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -89,6 +89,7 @@ extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdounlink(SMgrRelation reln, bool isRedo); +extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync); diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index 2752eacc9f..48265cc59d 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -63,22 +63,40 @@ typedef struct RelationData * rd_replidindex) */ bool rd_statvalid; /* is rd_statlist valid? */ - /* + /*---------- * rd_createSubid is the ID of the highest subtransaction the rel has - * survived into; or zero if the rel was not created in the current top - * transaction. This can be now be relied on, whereas previously it could - * be "forgotten" in earlier releases. Likewise, rd_newRelfilenodeSubid is - * the ID of the highest subtransaction the relfilenode change has - * survived into, or zero if not changed in the current transaction (or we - * have forgotten changing it). rd_newRelfilenodeSubid can be forgotten - * when a relation has multiple new relfilenodes within a single - * transaction, with one of them occurring in a subsequently aborted - * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t; - * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten + * survived into or zero if the rel was not created in the current top + * transaction. rd_firstRelfilenodeSubid is the ID of the highest + * subtransaction an rd_node change has survived into or zero if rd_node + * matches the value it had at the start of the current top transaction. + * (Rolling back the subtransaction that rd_firstRelfilenodeSubid denotes + * would restore rd_node to the value it had at the start of the current + * top transaction. Rolling back any lower subtransaction would not.) + * Their accuracy is critical to RelationNeedsWAL(). + * + * rd_newRelfilenodeSubid is the ID of the highest subtransaction the + * most-recent relfilenode change has survived into or zero if not changed + * in the current transaction (or we have forgotten changing it). This + * field is accurate when non-zero, but it can be zero when a relation has + * multiple new relfilenodes within a single transaction, with one of them + * occurring in a subsequently aborted subtransaction, e.g. + * BEGIN; + * TRUNCATE t; + * SAVEPOINT save; + * TRUNCATE t; + * ROLLBACK TO save; + * -- rd_newRelfilenodeSubid is now forgotten + * + * These fields are read-only outside relcache.c. Other files trigger + * rd_node changes by updating pg_class.reltablespace and/or + * pg_class.relfilenode. They must call RelationAssumeNewRelfilenode() to + * update these fields. */ SubTransactionId rd_createSubid; /* rel was created in current xact */ - SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in - * current xact */ + SubTransactionId rd_newRelfilenodeSubid; /* highest subxact changing + * rd_node to current value */ + SubTransactionId rd_firstRelfilenodeSubid; /* highest subxact changing + * rd_node to any value */ Form_pg_class rd_rel; /* RELATION tuple */ TupleDesc rd_att; /* tuple descriptor */ @@ -520,9 +538,16 @@ typedef struct ViewOptions /* * RelationNeedsWAL * True if relation needs WAL. - */ -#define RelationNeedsWAL(relation) \ - ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) + * + * Returns false if wal_level = minimal and this relation is created or + * truncated in the current transaction. See "Skipping WAL for New + * RelFileNode" in src/backend/access/transam/README. + */ +#define RelationNeedsWAL(relation) \ + ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \ + (XLogIsNeeded() || \ + (relation->rd_createSubid == InvalidSubTransactionId && \ + relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId))) /* * RelationUsesLocalBuffers diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h index 90487b2b2e..66e247d028 100644 --- a/src/include/utils/relcache.h +++ b/src/include/utils/relcache.h @@ -106,9 +106,10 @@ extern Relation RelationBuildLocalRelation(const char *relname, char relkind); /* - * Routine to manage assignment of new relfilenode to a relation + * Routines to manage assignment of new relfilenode to a relation */ extern void RelationSetNewRelfilenode(Relation relation, char persistence); +extern void RelationAssumeNewRelfilenode(Relation relation); /* * Routines for flushing/rebuilding relcache entries in various scenarios @@ -121,6 +122,11 @@ extern void RelationCacheInvalidate(void); extern void RelationCloseSmgrByOid(Oid relationId); +#ifdef USE_ASSERT_CHECKING +extern void AssertPendingSyncs_RelationCache(void); +#else +#define AssertPendingSyncs_RelationCache() do {} while (0) +#endif extern void AtEOXact_RelationCache(bool isCommit); extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid, SubTransactionId parentSubid); diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 297b8fbd6f..1ddde3ecce 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -2354,6 +2354,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc fputs("log_lock_waits = on\n", pg_conf); fputs("log_temp_files = 128kB\n", pg_conf); fputs("max_prepared_transactions = 2\n", pg_conf); + fputs("wal_level = minimal\n", pg_conf); /* XXX before commit remove */ + fputs("max_wal_senders = 0\n", pg_conf); for (sl = temp_configs; sl != NULL; sl = sl->next) { -- 2.23.0