Hello, Noah. At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota....@gmail.com> wrote in <20190827.154932.250364935.horikyota....@gmail.com> > I'm not sure whether the knob shows apparent performance gain and > whether we can offer the criteria to identify the proper > value. But I'll add this feature with a GUC > effective_io_block_size defaults to 64kB as the threshold in the > next version. (The name and default value are arguable, of course.)
This is a new version of the patch based on the discussion. The differences from v19 are the follows. - Removed the new stuff in two-phase.c. The action on PREPARE TRANSACTION is now taken in PrepareTransaction(). Instead of storing pending syncs in two-phase files, the function immediately syncs all files that can survive the transaction end. (twophase.c, xact.c) - Separate pendingSyncs from pendingDeletes. pendingSyncs gets handled differently from pendingDeletes so it is separated. - Let smgrDoPendingSyncs() to avoid performing fsync on to-be-deleted files. In previous versions the function syncs all recorded files even if it is being deleted. Since we use WAL-logging as the alternative of fsync now, performance gets more significance g than before. Thus this version avoids uesless fsyncs. - Use log_newpage instead of fsync for small tables. As in the discussion up-thread, I think I understand how WAL-logging works better than fsync. smgrDoPendingSync issues log_newpage for all blocks in the table smaller than the GUC variable "effective_io_block_size". I found log_newpage_range() that does exact what is needed here but it requires Relation that is no available there. I removed an assertion in CreateFakeRelcacheEntry so that it works while non-recovery mode. - Rebased and fixed some bugs. I'm trying to measure performance difference on WAL/fsync. By the way, smgrDoPendingDelete is called from CommitTransaction and AbortTransaction directlry, and from AbortSubTransaction via AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is called only from AbortSubTransaction. I think these should be unified either way. Any opinions? CommitTransaction() + msgrDoPendingDelete() AbortTransaction() + msgrDoPendingDelete() AbortSubTransactoin() AtSubAbort_smgr() + msgrDoPendingDelete() # Looking around, the prefixes AtEOact/PreCommit/AtAbort don't # seem to be used keeping a principle. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
>From 83deb772808cdd3afdb44a7630656cc827adfe33 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Thu, 11 Oct 2018 10:03:21 +0900 Subject: [PATCH 1/4] TAP test for copy-truncation optimization. --- src/test/recovery/t/018_wal_optimize.pl | 312 ++++++++++++++++++++++++++++++++ 1 file changed, 312 insertions(+) create mode 100644 src/test/recovery/t/018_wal_optimize.pl diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl new file mode 100644 index 0000000000..b041121745 --- /dev/null +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -0,0 +1,312 @@ +# Test WAL replay for optimized TRUNCATE and COPY records +# +# WAL truncation is optimized in some cases with TRUNCATE and COPY queries +# which sometimes interact badly with the other optimizations in line with +# several setting values of wal_level, particularly when using "minimal" or +# "replica". The optimization may be enabled or disabled depending on the +# scenarios dealt here, and should never result in any type of failures or +# data loss. +use strict; +use warnings; + +use PostgresNode; +use TestLib; +use Test::More tests => 26; + +sub check_orphan_relfilenodes +{ + my($node, $test_name) = @_; + + my $db_oid = $node->safe_psql('postgres', + "SELECT oid FROM pg_database WHERE datname = 'postgres'"); + my $prefix = "base/$db_oid/"; + my $filepaths_referenced = $node->safe_psql('postgres', " + SELECT pg_relation_filepath(oid) FROM pg_class + WHERE reltablespace = 0 and relpersistence <> 't' and + pg_relation_filepath(oid) IS NOT NULL;"); + is_deeply([sort(map { "$prefix$_" } + grep(/^[0-9]+$/, + slurp_dir($node->data_dir . "/$prefix")))], + [sort split /\n/, $filepaths_referenced], + $test_name); + return; +} + +# Wrapper routine tunable for wal_level. +sub run_wal_optimize +{ + my $wal_level = shift; + + # Primary needs to have wal_level = minimal here + my $node = get_new_node("node_$wal_level"); + $node->init; + $node->append_conf('postgresql.conf', qq( +wal_level = $wal_level +max_prepared_transactions = 1 +)); + $node->start; + + # Setup + my $tablespace_dir = $node->basedir . '/tablespace_other'; + mkdir ($tablespace_dir); + $tablespace_dir = TestLib::perl2host($tablespace_dir); + $node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); + + # Test direct truncation optimization. No tuples + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test1 (id serial PRIMARY KEY); + TRUNCATE test1; + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;"); + is($result, qq(0), + "wal_level = $wal_level, optimized truncation with empty table"); + + # Test truncation with inserted tuples within the same transaction. + # Tuples inserted after the truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test2 (id serial PRIMARY KEY); + INSERT INTO test2 VALUES (DEFAULT); + TRUNCATE test2; + INSERT INTO test2 VALUES (DEFAULT); + COMMIT;"); + + $node->stop('immediate'); + $node->start; + + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;"); + is($result, qq(1), + "wal_level = $wal_level, optimized truncation with inserted table"); + + + # Same for prepared transaction + # Tuples inserted after the truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test2a (id serial PRIMARY KEY); + INSERT INTO test2a VALUES (DEFAULT); + TRUNCATE test2a; + INSERT INTO test2a VALUES (DEFAULT); + PREPARE TRANSACTION 't'; + COMMIT PREPARED 't';"); + + $node->stop('immediate'); + $node->start; + + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;"); + is($result, qq(1), + "wal_level = $wal_level, optimized truncation with prepared transaction"); + + + # Data file for COPY query in follow-up tests. + my $basedir = $node->basedir; + my $copy_file = "$basedir/copy_data.txt"; + TestLib::append_to_file($copy_file, qq(20000,30000 +20001,30001 +20002,30002)); + + # Test truncation with inserted tuples using COPY. Tuples copied after the + # truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3; + COPY test3 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;"); + is($result, qq(3), + "wal_level = $wal_level, optimized truncation with copied table"); + + # Like previous test, but rollback SET TABLESPACE in a subtransaction. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3a; + SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s; + COPY test3a FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;"); + is($result, qq(3), + "wal_level = $wal_level, SET TABLESPACE in subtransaction"); + + # in different subtransaction patterns + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3a2; + SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s; + COPY test3a2 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;"); + is($result, qq(3), + "wal_level = $wal_level, SET TABLESPACE in subtransaction"); + + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int); + INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000)); + TRUNCATE test3a3; + SAVEPOINT s; + ALTER TABLE test3a3 SET TABLESPACE other; + SAVEPOINT s2; + ALTER TABLE test3a3 SET TABLESPACE pg_default; + ROLLBACK TO s2; + SAVEPOINT s2; + ALTER TABLE test3a3 SET TABLESPACE pg_default; + RELEASE s2; + ROLLBACK TO s; + COPY test3a3 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;"); + is($result, qq(3), + "wal_level = $wal_level, SET TABLESPACE in subtransaction"); + + # UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test3b (id serial PRIMARY KEY, id2 int); + INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000)); + COPY test3b FROM '$copy_file' DELIMITER ','; -- set sync_above + UPDATE test3b SET id2 = id2 + 1; + DELETE FROM test3b; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;"); + is($result, qq(0), + "wal_level = $wal_level, UPDATE of logged page extends relation"); + + # Test truncation with inserted tuples using both INSERT and COPY. Tuples + # inserted after the truncation should be seen. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test4 (id serial PRIMARY KEY, id2 int); + INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000)); + TRUNCATE test4; + INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000); + COPY test4 FROM '$copy_file' DELIMITER ','; + INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000); + COMMIT;"); + + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;"); + is($result, qq(5), + "wal_level = $wal_level, optimized truncation with inserted/copied table"); + + # Test consistency of COPY with INSERT for table created in the same + # transaction. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test5 (id serial PRIMARY KEY, id2 int); + INSERT INTO test5 VALUES (DEFAULT, 1); + COPY test5 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;"); + is($result, qq(4), + "wal_level = $wal_level, replay of optimized copy with inserted table"); + + # Test consistency of COPY that inserts more to the same table using + # triggers. If the INSERTS from the trigger go to the same block data + # is copied to, and the INSERTs are WAL-logged, WAL replay will fail when + # it tries to replay the WAL record but the "before" image doesn't match, + # because not all changes were WAL-logged. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test6 (id serial PRIMARY KEY, id2 text); + CREATE FUNCTION test6_before_row_trig() RETURNS trigger + LANGUAGE plpgsql as \$\$ + BEGIN + IF new.id2 NOT LIKE 'triggered%' THEN + INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2); + END IF; + RETURN NEW; + END; \$\$; + CREATE FUNCTION test6_after_row_trig() RETURNS trigger + LANGUAGE plpgsql as \$\$ + BEGIN + IF new.id2 NOT LIKE 'triggered%' THEN + INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2); + END IF; + RETURN NEW; + END; \$\$; + CREATE TRIGGER test6_before_row_insert + BEFORE INSERT ON test6 + FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig(); + CREATE TRIGGER test6_after_row_insert + AFTER INSERT ON test6 + FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig(); + COPY test6 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;"); + is($result, qq(9), + "wal_level = $wal_level, replay of optimized copy with before trigger"); + + # Test consistency of INSERT, COPY and TRUNCATE in same transaction block + # with TRUNCATE triggers. + $node->safe_psql('postgres', " + BEGIN; + CREATE TABLE test7 (id serial PRIMARY KEY, id2 text); + CREATE FUNCTION test7_before_stat_trig() RETURNS trigger + LANGUAGE plpgsql as \$\$ + BEGIN + INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before'); + RETURN NULL; + END; \$\$; + CREATE FUNCTION test7_after_stat_trig() RETURNS trigger + LANGUAGE plpgsql as \$\$ + BEGIN + INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before'); + RETURN NULL; + END; \$\$; + CREATE TRIGGER test7_before_stat_truncate + BEFORE TRUNCATE ON test7 + FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig(); + CREATE TRIGGER test7_after_stat_truncate + AFTER TRUNCATE ON test7 + FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig(); + INSERT INTO test7 VALUES (DEFAULT, 1); + TRUNCATE test7; + COPY test7 FROM '$copy_file' DELIMITER ','; + COMMIT;"); + $node->stop('immediate'); + $node->start; + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;"); + is($result, qq(4), + "wal_level = $wal_level, replay of optimized copy with before trigger"); + + # Test redo of temp table creation. + $node->safe_psql('postgres', " + CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);"); + $node->stop('immediate'); + $node->start; + + check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains"); + + return; +} + +# Run same test suite for multiple wal_level values. +run_wal_optimize("minimal"); +run_wal_optimize("replica"); -- 2.16.3
>From e0650491226a689120d19060ad5da0917f7d3bd6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Wed, 21 Aug 2019 13:57:00 +0900 Subject: [PATCH 2/4] Fix WAL skipping feature WAL-skipping operations mixed with WAL-logged operations can lead to database corruption after a crash. This patch changes the WAL-skipping feature so that no data modification is WAL-logged at all then sync such relations at commit. --- src/backend/access/heap/heapam.c | 4 +- src/backend/access/heap/heapam_handler.c | 22 +-- src/backend/access/heap/rewriteheap.c | 13 +- src/backend/access/transam/xact.c | 17 ++ src/backend/access/transam/xlogutils.c | 11 +- src/backend/catalog/storage.c | 295 +++++++++++++++++++++++++++---- src/backend/commands/cluster.c | 24 +++ src/backend/commands/copy.c | 39 +--- src/backend/commands/createas.c | 5 +- src/backend/commands/matview.c | 4 - src/backend/commands/tablecmds.c | 10 +- src/backend/storage/buffer/bufmgr.c | 41 +++-- src/backend/storage/smgr/md.c | 30 ++++ src/backend/utils/cache/relcache.c | 28 ++- src/backend/utils/misc/guc.c | 13 ++ src/include/access/heapam.h | 1 - src/include/access/rewriteheap.h | 2 +- src/include/access/tableam.h | 40 +---- src/include/catalog/storage.h | 12 ++ src/include/storage/bufmgr.h | 1 + src/include/storage/md.h | 1 + src/include/utils/rel.h | 17 +- 22 files changed, 455 insertions(+), 175 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index cb811d345a..ef18b61c55 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1936,7 +1936,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, MarkBufferDirty(buffer); /* XLOG stuff */ - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation)) + if (RelationNeedsWAL(relation)) { xl_heap_insert xlrec; xl_heap_header xlhdr; @@ -2119,7 +2119,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, /* currently not needed (thus unsupported) for heap_multi_insert() */ AssertArg(!(options & HEAP_INSERT_NO_LOGICAL)); - needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation); + needwal = RelationNeedsWAL(relation); saveFreeSpace = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index f1ff01e8cb..27f414a361 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -558,18 +558,6 @@ tuple_lock_retry: return result; } -static void -heapam_finish_bulk_insert(Relation relation, int options) -{ - /* - * If we skipped writing WAL, then we need to sync the heap (but not - * indexes since those use WAL anyway / don't go through tableam) - */ - if (options & HEAP_INSERT_SKIP_WAL) - heap_sync(relation); -} - - /* ------------------------------------------------------------------------ * DDL related callbacks for heap AM. * ------------------------------------------------------------------------ @@ -701,7 +689,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, IndexScanDesc indexScan; TableScanDesc tableScan; HeapScanDesc heapScan; - bool use_wal; bool is_system_catalog; Tuplesortstate *tuplesort; TupleDesc oldTupDesc = RelationGetDescr(OldHeap); @@ -716,12 +703,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, is_system_catalog = IsSystemRelation(OldHeap); /* - * We need to log the copied data in WAL iff WAL archiving/streaming is - * enabled AND it's a WAL-logged rel. + * smgr_targblock must be initially invalid if we are to skip WAL logging */ - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap); - - /* use_wal off requires smgr_targblock be initially invalid */ Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber); /* Preallocate values/isnull arrays */ @@ -731,7 +714,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, /* Initialize the rewrite operation */ rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff, - *multi_cutoff, use_wal); + *multi_cutoff); /* Set up sorting if wanted */ @@ -2519,7 +2502,6 @@ static const TableAmRoutine heapam_methods = { .tuple_delete = heapam_tuple_delete, .tuple_update = heapam_tuple_update, .tuple_lock = heapam_tuple_lock, - .finish_bulk_insert = heapam_finish_bulk_insert, .tuple_fetch_row_version = heapam_fetch_row_version, .tuple_get_latest_tid = heap_get_latest_tid, diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index a17508a82f..9e0d7295af 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -144,7 +144,6 @@ typedef struct RewriteStateData Page rs_buffer; /* page currently being built */ BlockNumber rs_blockno; /* block where page will go */ bool rs_buffer_valid; /* T if any tuples in buffer */ - bool rs_use_wal; /* must we WAL-log inserts? */ bool rs_logical_rewrite; /* do we need to do logical rewriting */ TransactionId rs_oldest_xmin; /* oldest xmin used by caller to determine * tuple visibility */ @@ -238,15 +237,13 @@ static void logical_end_heap_rewrite(RewriteState state); * oldest_xmin xid used by the caller to determine which tuples are dead * freeze_xid xid before which tuples will be frozen * cutoff_multi multixact before which multis will be removed - * use_wal should the inserts to the new heap be WAL-logged? * * Returns an opaque RewriteState, allocated in current memory context, * to be used in subsequent calls to the other functions. */ RewriteState begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin, - TransactionId freeze_xid, MultiXactId cutoff_multi, - bool use_wal) + TransactionId freeze_xid, MultiXactId cutoff_multi) { RewriteState state; MemoryContext rw_cxt; @@ -271,7 +268,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm /* new_heap needn't be empty, just locked */ state->rs_blockno = RelationGetNumberOfBlocks(new_heap); state->rs_buffer_valid = false; - state->rs_use_wal = use_wal; state->rs_oldest_xmin = oldest_xmin; state->rs_freeze_xid = freeze_xid; state->rs_cutoff_multi = cutoff_multi; @@ -330,7 +326,7 @@ end_heap_rewrite(RewriteState state) /* Write the last page, if any */ if (state->rs_buffer_valid) { - if (state->rs_use_wal) + if (RelationNeedsWAL(state->rs_new_rel)) log_newpage(&state->rs_new_rel->rd_node, MAIN_FORKNUM, state->rs_blockno, @@ -654,9 +650,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup) { int options = HEAP_INSERT_SKIP_FSM; - if (!state->rs_use_wal) - options |= HEAP_INSERT_SKIP_WAL; - /* * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data * for the TOAST table are not logically decoded. The main heap is @@ -695,7 +688,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup) /* Doesn't fit, so write out the existing page */ /* XLOG stuff */ - if (state->rs_use_wal) + if (RelationNeedsWAL(state->rs_new_rel)) log_newpage(&state->rs_new_rel->rd_node, MAIN_FORKNUM, state->rs_blockno, diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index f594d33e7a..1c4b264947 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2107,6 +2107,13 @@ CommitTransaction(void) */ PreCommit_on_commit_actions(); + /* + * Synchronize files that are created and not WAL-logged during this + * transaction. This must happen before emitting commit record so that we + * don't see committed-but-broken files after a crash. + */ + smgrDoPendingSyncs(true, false); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2339,6 +2346,14 @@ PrepareTransaction(void) */ PreCommit_on_commit_actions(); + /* + * Sync all WAL-skipped files now. Some of them may be deleted at + * transaction end but we don't bother store that information in PREPARE + * record or two-phase files. Like commit, we should sync WAL-skipped + * files before emitting PREPARE record. See CommitTransaction(). + */ + smgrDoPendingSyncs(true, true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2657,6 +2672,7 @@ AbortTransaction(void) */ AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); + smgrDoPendingSyncs(false, false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); @@ -4941,6 +4957,7 @@ AbortSubTransaction(void) s->parent->curTransactionOwner); AtEOSubXact_LargeObject(false, s->subTransactionId, s->parent->subTransactionId); + smgrDoPendingSyncs(false, false); AtSubAbort_Notify(); /* Advertise the fact that we aborted in pg_xact. */ diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 1fc39333f1..ff7dba429a 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -544,6 +544,8 @@ typedef FakeRelCacheEntryData *FakeRelCacheEntry; * fields related to physical storage, like rd_rel, are initialized, so the * fake entry is only usable in low-level operations like ReadBuffer(). * + * This is also used for syncing WAL-skipped files. + * * Caller must free the returned entry with FreeFakeRelcacheEntry(). */ Relation @@ -552,18 +554,19 @@ CreateFakeRelcacheEntry(RelFileNode rnode) FakeRelCacheEntry fakeentry; Relation rel; - Assert(InRecovery); - /* Allocate the Relation struct and all related space in one block. */ fakeentry = palloc0(sizeof(FakeRelCacheEntryData)); rel = (Relation) fakeentry; rel->rd_rel = &fakeentry->pgc; rel->rd_node = rnode; - /* We will never be working with temp rels during recovery */ + /* + * We will never be working with temp rels during recovery or syncing + * WAL-skpped files. + */ rel->rd_backend = InvalidBackendId; - /* It must be a permanent table if we're in recovery. */ + /* It must be a permanent table here */ rel->rd_rel->relpersistence = RELPERSISTENCE_PERMANENT; /* We don't know the name of the relation; use relfilenode instead */ diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 3cc886f7fe..43926ecaba 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -30,9 +30,13 @@ #include "catalog/storage_xlog.h" #include "storage/freespace.h" #include "storage/smgr.h" +#include "utils/hsearch.h" #include "utils/memutils.h" #include "utils/rel.h" +/* GUC variables */ +int effective_io_block_size = 64; /* threshold of WAL-skipping in kilobytes */ + /* * We keep a list of all relations (represented as RelFileNode values) * that have been created or deleted in the current transaction. When @@ -53,16 +57,17 @@ * but I'm being paranoid. */ -typedef struct PendingRelDelete +typedef struct PendingRelOp { RelFileNode relnode; /* relation that may need to be deleted */ BackendId backend; /* InvalidBackendId if not a temp rel */ - bool atCommit; /* T=delete at commit; F=delete at abort */ + bool atCommit; /* T=work at commit; F=work at abort */ int nestLevel; /* xact nesting level of request */ - struct PendingRelDelete *next; /* linked-list link */ -} PendingRelDelete; + struct PendingRelOp *next; /* linked-list link */ +} PendingRelOp; -static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingRelOp *pendingDeletes = NULL; /* head of linked list */ +static PendingRelOp *pendingSyncs = NULL; /* head of linked list */ /* * RelationCreateStorage @@ -78,7 +83,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelOp *pending; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -109,8 +114,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) - MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending = (PendingRelOp *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp)); pending->relnode = rnode; pending->backend = backend; pending->atCommit = false; /* delete if abort */ @@ -118,6 +123,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending->next = pendingDeletes; pendingDeletes = pending; + /* + * When wal_level = minimal, we are going to skip WAL-logging for storage + * of persistent relations created in the current transaction. The + * relation needs to be synced at commit. + */ + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) + { + int nestLevel = GetCurrentTransactionNestLevel(); + + pending = (PendingRelOp *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp)); + pending->relnode = rnode; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = nestLevel; + pending->next = pendingSyncs; + pendingSyncs = pending; + } + return srel; } @@ -147,11 +171,11 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) void RelationDropStorage(Relation rel) { - PendingRelDelete *pending; + PendingRelOp *pending; /* Add the relation to the list of stuff to delete at commit */ - pending = (PendingRelDelete *) - MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending = (PendingRelOp *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelOp)); pending->relnode = rel->rd_node; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ @@ -192,9 +216,9 @@ RelationDropStorage(Relation rel) void RelationPreserveStorage(RelFileNode rnode, bool atCommit) { - PendingRelDelete *pending; - PendingRelDelete *prev; - PendingRelDelete *next; + PendingRelOp *pending; + PendingRelOp *prev; + PendingRelOp *next; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) @@ -399,9 +423,9 @@ void smgrDoPendingDeletes(bool isCommit) { int nestLevel = GetCurrentTransactionNestLevel(); - PendingRelDelete *pending; - PendingRelDelete *prev; - PendingRelDelete *next; + PendingRelOp *pending; + PendingRelOp *prev; + PendingRelOp *next; int nrels = 0, i = 0, maxrels = 0; @@ -462,11 +486,195 @@ smgrDoPendingDeletes(bool isCommit) } /* - * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted. + * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. * - * The return value is the number of relations scheduled for termination. - * *ptr is set to point to a freshly-palloc'd array of RelFileNodes. - * If there are no relations to be deleted, *ptr is set to NULL. + * This should be called before smgrDoPendingDeletes() at every subtransaction + * end. Also this should be called before emitting WAL record so that sync + * failure prevents commit. + * + * If sync_all is true, syncs all files including that are scheduled to be + * deleted. + */ +void +smgrDoPendingSyncs(bool isCommit, bool sync_all) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingRelOp *pending; + PendingRelOp *prev; + PendingRelOp *next; + SMgrRelation srel = NULL; + ForkNumber fork; + BlockNumber nblocks[MAX_FORKNUM + 1]; + BlockNumber total_blocks = 0; + HTAB *delhash = NULL; + + /* Return if nothing to be synced in this nestlevel */ + if (!pendingSyncs || pendingSyncs->nestLevel < nestLevel) + return; + + Assert (pendingSyncs->nestLevel <= nestLevel); + Assert (pendingSyncs->backend == InvalidBackendId); + + /* + * If sync_all is false, pending syncs on the relation that are to be + * deleted in this transaction-end should be ignored. Collect pending + * deletes that will happen in the following call to + * smgrDoPendingDeletes(). + */ + if (!sync_all) + { + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + bool found PG_USED_FOR_ASSERTS_ONLY; + + if (pending->nestLevel < pendingSyncs->nestLevel || + pending->atCommit != isCommit) + continue; + + /* create the hash if not yet */ + if (delhash == NULL) + { + HASHCTL hash_ctl; + + memset(&hash_ctl, 0, sizeof(hash_ctl)); + hash_ctl.keysize = sizeof(RelFileNode); + hash_ctl.entrysize = sizeof(RelFileNode); + hash_ctl.hcxt = CurrentMemoryContext; + delhash = + hash_create("pending del temporary hash", 8, &hash_ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + } + + (void) hash_search(delhash, (void *) &(pending->relnode), + HASH_ENTER, &found); + Assert(!found); + } + } + + /* Loop over pendingSyncs */ + prev = NULL; + for (pending = pendingSyncs; pending != NULL; pending = next) + { + bool to_be_removed = (!isCommit); /* don't sync if aborted */ + + next = pending->next; + + /* outer-level entries should not be processed yet */ + if (pending->nestLevel < nestLevel) + { + prev = pending; + continue; + } + + /* don't sync relnodes that is being deleted */ + if (delhash && !to_be_removed) + hash_search(delhash, (void *) &pending->relnode, + HASH_FIND, &to_be_removed); + + /* remove the entry if no longer useful */ + if (to_be_removed) + { + if (prev) + prev->next = next; + else + pendingSyncs = next; + pfree(pending); + continue; + } + + /* actual sync happens at the end of top transaction */ + if (nestLevel > 1) + { + prev = pending; + continue; + } + + /* Now the time to sync the rnode */ + srel = smgropen(pendingSyncs->relnode, pendingSyncs->backend); + + /* + * We emit newpage WAL records for smaller size of relations. + * + * Small WAL records have a chance to be emitted at once along with + * other backends' WAL records. We emit WAL records instead of syncing + * for files that are smaller than a certain threshold expecting + * faster commit. The threshold is defined by the GUC + * effective_io_block_size. + */ + for (fork = 0 ; fork <= MAX_FORKNUM ; fork++) + { + /* FSM doesn't need WAL nor sync */ + if (fork != FSM_FORKNUM && smgrexists(srel, fork)) + { + BlockNumber n = smgrnblocks(srel, fork); + + /* we shouldn't come here for unlogged relations */ + Assert(fork != INIT_FORKNUM); + + nblocks[fork] = n; + total_blocks += n; + } + else + nblocks[fork] = InvalidBlockNumber; + } + + /* + * Sync file or emit WAL record for the file according to the total + * size. + */ + if (total_blocks * BLCKSZ >= effective_io_block_size * 1024) + { + /* Flush all buffers then sync the file */ + FlushRelationBuffersWithoutRelcache(srel->smgr_rnode.node, false); + + for (fork = 0; fork <= MAX_FORKNUM; fork++) + { + if (smgrexists(srel, fork)) + smgrimmedsync(srel, fork); + } + } + else + { + /* + * Emit WAL records for all blocks. Some of the blocks might have + * been synced or evicted, but We don't bother checking that. The + * file is small enough. + */ + for (fork = 0 ; fork <= MAX_FORKNUM ; fork++) + { + bool page_std = (fork == MAIN_FORKNUM); + int n = nblocks[fork]; + Relation rel; + + if (!BlockNumberIsValid(n)) + continue; + + /* Emit WAL for the whole file */ + rel = CreateFakeRelcacheEntry(srel->smgr_rnode.node); + log_newpage_range(rel, fork, 0, n, page_std); + FreeFakeRelcacheEntry(rel); + } + } + + /* done remove from list */ + if (prev) + prev->next = next; + else + pendingSyncs = next; + pfree(pending); + } + + if (delhash) + hash_destroy(delhash); +} + +/* + * smgrGetPendingOperations() -- Get a list of non-temp relations to be + * deleted or synced. + * + * The return value is the number of relations scheduled in the given + * list. *ptr is set to point to a freshly-palloc'd array of RelFileNodes. If + * there are no matching relations, *ptr is set to NULL. * * Only non-temporary relations are included in the returned list. This is OK * because the list is used only in contexts where temporary relations don't @@ -475,19 +683,19 @@ smgrDoPendingDeletes(bool isCommit) * (and all temporary files will be zapped if we restart anyway, so no need * for redo to do it also). * - * Note that the list does not include anything scheduled for termination - * by upper-level transactions. + * Note that the list does not include anything scheduled by upper-level + * transactions. */ -int -smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) +static inline int +smgrGetPendingOperations(PendingRelOp *list, bool forCommit, RelFileNode **ptr) { int nestLevel = GetCurrentTransactionNestLevel(); int nrels; RelFileNode *rptr; - PendingRelDelete *pending; + PendingRelOp *pending; nrels = 0; - for (pending = pendingDeletes; pending != NULL; pending = pending->next) + for (pending = list; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit && pending->backend == InvalidBackendId) @@ -500,7 +708,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) } rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode)); *ptr = rptr; - for (pending = pendingDeletes; pending != NULL; pending = pending->next) + for (pending = list; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit && pending->backend == InvalidBackendId) @@ -512,6 +720,20 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) return nrels; } +/* Returns list of pending deletes, see smgrGetPendingOperations for details */ +int +smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) +{ + return smgrGetPendingOperations(pendingDeletes, forCommit, ptr); +} + +/* Returns list of pending syncs, see smgrGetPendingOperations for details */ +int +smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr) +{ + return smgrGetPendingOperations(pendingSyncs, forCommit, ptr); +} + /* * PostPrepare_smgr -- Clean up after a successful PREPARE * @@ -522,8 +744,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) void PostPrepare_smgr(void) { - PendingRelDelete *pending; - PendingRelDelete *next; + PendingRelOp *pending; + PendingRelOp *next; for (pending = pendingDeletes; pending != NULL; pending = next) { @@ -532,25 +754,34 @@ PostPrepare_smgr(void) /* must explicitly free the list entry */ pfree(pending); } + + /* We shouldn't have an entry in pendingSyncs */ + Assert(pendingSyncs == NULL); } /* * AtSubCommit_smgr() --- Take care of subtransaction commit. * - * Reassign all items in the pending-deletes list to the parent transaction. + * Reassign all items in the pending-operations list to the parent transaction. */ void AtSubCommit_smgr(void) { int nestLevel = GetCurrentTransactionNestLevel(); - PendingRelDelete *pending; + PendingRelOp *pending; for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel) pending->nestLevel = nestLevel - 1; } + + for (pending = pendingSyncs; pending != NULL; pending = pending->next) + { + if (pending->nestLevel >= nestLevel) + pending->nestLevel = nestLevel - 1; + } } /* diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 28985a07ec..f665ee8358 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -1034,12 +1034,36 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class, if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2)) { + Relation rel1; + Relation rel2; + /* * Normal non-mapped relations: swap relfilenodes, reltablespaces, * relpersistence */ Assert(!target_is_pg_class); + /* Update creation subid hints of relcache */ + rel1 = relation_open(r1, ExclusiveLock); + rel2 = relation_open(r2, ExclusiveLock); + + /* + * New relation's relfilenode is created in the current transaction + * and used as old ralation's new relfilenode. So its + * newRelfilenodeSubid as new relation's createSubid. We don't fix + * rel2 since it would be deleted soon. + */ + Assert(rel2->rd_createSubid != InvalidSubTransactionId); + rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid; + + /* record the first relfilenode change in the current transaction */ + if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId) + rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId(); + + relation_close(rel1, ExclusiveLock); + relation_close(rel2, ExclusiveLock); + + /* swap relfilenodes, reltablespaces, relpersistence */ swaptemp = relform1->relfilenode; relform1->relfilenode = relform2->relfilenode; relform2->relfilenode = swaptemp; diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 3aeef30b28..3ce04f7efc 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2534,9 +2534,6 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo, for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++) ExecDropSingleTupleTableSlot(buffer->slots[i]); - table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc, - miinfo->ti_options); - pfree(buffer); } @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate) * If it does commit, we'll have done the table_finish_bulk_insert() at * the bottom of this routine first. * - * As mentioned in comments in utils/rel.h, the in-same-transaction test - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid - * can be cleared before the end of the transaction. The exact case is - * when a relation sets a new relfilenode twice in same transaction, yet - * the second one fails in an aborted subtransaction, e.g. - * - * BEGIN; - * TRUNCATE t; - * SAVEPOINT save; - * TRUNCATE t; - * ROLLBACK TO save; - * COPY ... - * - * Also, if the target file is new-in-transaction, we assume that checking - * FSM for free space is a waste of time, even if we must use WAL because - * of archiving. This could possibly be wrong, but it's unlikely. - * - * The comments for table_tuple_insert and RelationGetBufferForTuple - * specify that skipping WAL logging is only safe if we ensure that our - * tuples do not go into pages containing tuples from any other - * transactions --- but this must be the case if we have a new table or - * new relfilenode, so we need no additional work to enforce that. + * If the target file is new-in-transaction, we assume that checking FSM + * for free space is a waste of time, even if we must use WAL because of + * archiving. This could possibly be wrong, but it's unlikely. * * We currently don't support this optimization if the COPY target is a * partitioned table as we currently only lazily initialize partition @@ -2762,15 +2740,14 @@ CopyFrom(CopyState cstate) * are not supported as per the description above. *---------- */ - /* createSubid is creation check, newRelfilenodeSubid is truncation check */ + /* + * createSubid is creation check, firstRelfilenodeSubid is truncation and + * cluster check. Partitioned table doesn't have storage. + */ if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) && (cstate->rel->rd_createSubid != InvalidSubTransactionId || - cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)) - { + cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId)) ti_options |= TABLE_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) - ti_options |= TABLE_INSERT_SKIP_WAL; - } /* * Optimize if new relfilenode was created in this subxact or one of its diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index b7d220699f..8a91d946e3 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) * We can skip WAL-logging the insertions, unless PITR or streaming * replication is in use. We can skip the FSM in any case. */ - myState->ti_options = TABLE_INSERT_SKIP_FSM | - (XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL); + myState->ti_options = TABLE_INSERT_SKIP_FSM; myState->bistate = GetBulkInsertState(); /* Not using WAL requires smgr_targblock be initially invalid */ @@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self) FreeBulkInsertState(myState->bistate); - table_finish_bulk_insert(myState->rel, myState->ti_options); - /* close rel, but keep lock until commit */ table_close(myState->rel, NoLock); myState->rel = NULL; diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c index 537d0e8cef..1c854dcebf 100644 --- a/src/backend/commands/matview.c +++ b/src/backend/commands/matview.c @@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo) * replication is in use. We can skip the FSM in any case. */ myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN; - if (!XLogIsNeeded()) - myState->ti_options |= TABLE_INSERT_SKIP_WAL; myState->bistate = GetBulkInsertState(); /* Not using WAL requires smgr_targblock be initially invalid */ @@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self) FreeBulkInsertState(myState->bistate); - table_finish_bulk_insert(myState->transientrel, myState->ti_options); - /* close transientrel, but keep lock until commit */ table_close(myState->transientrel, NoLock); myState->transientrel = NULL; diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index cceefbdd49..2468b178cb 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -4762,9 +4762,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) /* * Prepare a BulkInsertState and options for table_tuple_insert. Because - * we're building a new heap, we can skip WAL-logging and fsync it to disk - * at the end instead (unless WAL-logging is required for archiving or - * streaming replication). The FSM is empty too, so don't bother using it. + * we're building a new heap, the underlying table AM can skip WAL-logging + * and smgr will sync the relation to disk at the end of the current + * transaction instead. The FSM is empty too, so don't bother using it. */ if (newrel) { @@ -4772,8 +4772,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) bistate = GetBulkInsertState(); ti_options = TABLE_INSERT_SKIP_FSM; - if (!XLogIsNeeded()) - ti_options |= TABLE_INSERT_SKIP_WAL; } else { @@ -5058,8 +5056,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode) { FreeBulkInsertState(bistate); - table_finish_bulk_insert(newrel, ti_options); - table_close(newrel, NoLock); } } diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6f3a402854..55c122b3a7 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -171,6 +171,7 @@ static HTAB *PrivateRefCountHash = NULL; static int32 PrivateRefCountOverflowed = 0; static uint32 PrivateRefCountClock = 0; static PrivateRefCountEntry *ReservedRefCountEntry = NULL; +static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal); static void ReservePrivateRefCountEntry(void); static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer); @@ -675,10 +676,10 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum, * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require * a relcache entry for the relation. * - * NB: At present, this function may only be used on permanent relations, which - * is OK, because we only use it during XLOG replay. If in the future we - * want to use it on temporary or unlogged relations, we could pass additional - * parameters. + * NB: At present, this function may only be used on permanent relations, + * which is OK, because we only use it during XLOG replay and processing + * pending syncs. If in the future we want to use it on temporary or unlogged + * relations, we could pass additional parameters. */ Buffer ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum, @@ -3191,20 +3192,32 @@ PrintPinnedBufs(void) void FlushRelationBuffers(Relation rel) { - int i; - BufferDesc *bufHdr; - - /* Open rel at the smgr level if not already done */ RelationOpenSmgr(rel); - if (RelationUsesLocalBuffers(rel)) + FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel)); +} + +void +FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal) +{ + FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal); +} + +static void +FlushRelationBuffers_common(SMgrRelation smgr, bool islocal) +{ + RelFileNode rnode = smgr->smgr_rnode.node; + int i; + BufferDesc *bufHdr; + + if (islocal) { for (i = 0; i < NLocBuffer; i++) { uint32 buf_state; bufHdr = GetLocalBufferDescriptor(i); - if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) && + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && ((buf_state = pg_atomic_read_u32(&bufHdr->state)) & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { @@ -3221,7 +3234,7 @@ FlushRelationBuffers(Relation rel) PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); - smgrwrite(rel->rd_smgr, + smgrwrite(smgr, bufHdr->tag.forkNum, bufHdr->tag.blockNum, localpage, @@ -3251,18 +3264,18 @@ FlushRelationBuffers(Relation rel) * As in DropRelFileNodeBuffers, an unlocked precheck should be safe * and saves some cycles. */ - if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node)) + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode)) continue; ReservePrivateRefCountEntry(); buf_state = LockBufHdr(bufHdr); - if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) && + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) { PinBuffer_Locked(bufHdr); LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); - FlushBuffer(bufHdr, rel->rd_smgr); + FlushBuffer(bufHdr, smgr); LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); UnpinBuffer(bufHdr, true); } diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 07f3c93d3f..514c6098e6 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -994,6 +994,36 @@ ForgetDatabaseSyncRequests(Oid dbid) RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ ); } +/* + * SyncRelationFiles -- sync files of all given relations + * + * This function is assumed to be called only when skipping WAL-logging and + * emits no xlog records. + */ +void +SyncRelationFiles(RelFileNode *syncrels, int nsyncrels) +{ + int i; + + for (i = 0; i < nsyncrels; i++) + { + SMgrRelation srel; + ForkNumber fork; + + /* sync all existing forks of the relation */ + FlushRelationBuffersWithoutRelcache(syncrels[i], false); + srel = smgropen(syncrels[i], InvalidBackendId); + + for (fork = 0; fork <= MAX_FORKNUM; fork++) + { + if (smgrexists(srel, fork)) + smgrimmedsync(srel, fork); + } + + smgrclose(srel); + } +} + /* * DropRelationFiles -- drop files of all given relations */ diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 248860758c..147babb6b5 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -1096,6 +1096,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt) relation->rd_isnailed = false; relation->rd_createSubid = InvalidSubTransactionId; relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; switch (relation->rd_rel->relpersistence) { case RELPERSISTENCE_UNLOGGED: @@ -1829,6 +1830,7 @@ formrdesc(const char *relationName, Oid relationReltype, relation->rd_isnailed = true; relation->rd_createSubid = InvalidSubTransactionId; relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; relation->rd_backend = InvalidBackendId; relation->rd_islocaltemp = false; @@ -2094,7 +2096,7 @@ RelationClose(Relation relation) #ifdef RELCACHE_FORCE_RELEASE if (RelationHasReferenceCountZero(relation) && relation->rd_createSubid == InvalidSubTransactionId && - relation->rd_newRelfilenodeSubid == InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId) RelationClearRelation(relation, false); #endif } @@ -2510,8 +2512,8 @@ RelationClearRelation(Relation relation, bool rebuild) * problem. * * When rebuilding an open relcache entry, we must preserve ref count, - * rd_createSubid/rd_newRelfilenodeSubid, and rd_toastoid state. Also - * attempt to preserve the pg_class entry (rd_rel), tupledesc, + * rd_createSubid/rd_new/firstRelfilenodeSubid, and rd_toastoid state. + * Also attempt to preserve the pg_class entry (rd_rel), tupledesc, * rewrite-rule, partition key, and partition descriptor substructures * in place, because various places assume that these structures won't * move while they are working with an open relcache entry. (Note: @@ -2600,6 +2602,7 @@ RelationClearRelation(Relation relation, bool rebuild) /* creation sub-XIDs must be preserved */ SWAPFIELD(SubTransactionId, rd_createSubid); SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid); + SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid); /* un-swap rd_rel pointers, swap contents instead */ SWAPFIELD(Form_pg_class, rd_rel); /* ... but actually, we don't have to update newrel->rd_rel */ @@ -2667,7 +2670,7 @@ static void RelationFlushRelation(Relation relation) { if (relation->rd_createSubid != InvalidSubTransactionId || - relation->rd_newRelfilenodeSubid != InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId) { /* * New relcache entries are always rebuilt, not flushed; else we'd @@ -2807,7 +2810,7 @@ RelationCacheInvalidate(void) * pending invalidations. */ if (relation->rd_createSubid != InvalidSubTransactionId || - relation->rd_newRelfilenodeSubid != InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId) continue; relcacheInvalsReceived++; @@ -3064,6 +3067,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit) * Likewise, reset the hint about the relfilenode being new. */ relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; } /* @@ -3155,7 +3159,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, } /* - * Likewise, update or drop any new-relfilenode-in-subtransaction hint. + * Likewise, update or drop any new-relfilenode-in-subtransaction hints. */ if (relation->rd_newRelfilenodeSubid == mySubid) { @@ -3164,6 +3168,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit, else relation->rd_newRelfilenodeSubid = InvalidSubTransactionId; } + + if (relation->rd_firstRelfilenodeSubid == mySubid) + { + if (isCommit) + relation->rd_firstRelfilenodeSubid = parentSubid; + else + relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId; + } } @@ -3253,6 +3265,7 @@ RelationBuildLocalRelation(const char *relname, /* it's being created in this transaction */ rel->rd_createSubid = GetCurrentSubTransactionId(); rel->rd_newRelfilenodeSubid = InvalidSubTransactionId; + rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId; /* * create a new tuple descriptor from the one passed in. We do this @@ -3556,6 +3569,8 @@ RelationSetNewRelfilenode(Relation relation, char persistence) * operations on the rel in the same transaction. */ relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId(); + if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId) + relation->rd_firstRelfilenodeSubid = relation->rd_newRelfilenodeSubid; /* Flag relation as needing eoxact cleanup (to remove the hint) */ EOXactListAdd(relation); @@ -5592,6 +5607,7 @@ load_relcache_init_file(bool shared) rel->rd_fkeylist = NIL; rel->rd_createSubid = InvalidSubTransactionId; rel->rd_newRelfilenodeSubid = InvalidSubTransactionId; + rel->rd_firstRelfilenodeSubid = InvalidSubTransactionId; rel->rd_amcache = NULL; MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info)); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 90ffd89339..1e4fc256fc 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -36,6 +36,7 @@ #include "access/xlog_internal.h" #include "catalog/namespace.h" #include "catalog/pg_authid.h" +#include "catalog/storage.h" #include "commands/async.h" #include "commands/prepare.h" #include "commands/user.h" @@ -2774,6 +2775,18 @@ static struct config_int ConfigureNamesInt[] = check_effective_io_concurrency, assign_effective_io_concurrency, NULL }, + { + {"effective_io_block_size", PGC_USERSET, RESOURCES_DISK, + gettext_noop("Size of file that can be fsync'ed in the minimum required duration."), + gettext_noop("For rotating magnetic disks, it is around the size of a track or sylinder."), + GUC_UNIT_KB + }, + &effective_io_block_size, + 64, + 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + { {"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS, gettext_noop("Number of pages after which previously performed writes are flushed to disk."), diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 858bcb6bc9..80c2e1bafc 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -29,7 +29,6 @@ /* "options" flag bits for heap_insert */ -#define HEAP_INSERT_SKIP_WAL TABLE_INSERT_SKIP_WAL #define HEAP_INSERT_SKIP_FSM TABLE_INSERT_SKIP_FSM #define HEAP_INSERT_FROZEN TABLE_INSERT_FROZEN #define HEAP_INSERT_NO_LOGICAL TABLE_INSERT_NO_LOGICAL diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h index 8056253916..7f9736e294 100644 --- a/src/include/access/rewriteheap.h +++ b/src/include/access/rewriteheap.h @@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState; extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap, TransactionId OldestXmin, TransactionId FreezeXid, - MultiXactId MultiXactCutoff, bool use_wal); + MultiXactId MultiXactCutoff); extern void end_heap_rewrite(RewriteState state); extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple, HeapTuple newTuple); diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 7f81703b78..b652cd6cef 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -407,22 +407,6 @@ typedef struct TableAmRoutine uint8 flags, TM_FailureData *tmfd); - /* - * Perform operations necessary to complete insertions made via - * tuple_insert and multi_insert with a BulkInsertState specified. This - * may for example be used to flush the relation, when the - * TABLE_INSERT_SKIP_WAL option was used. - * - * Typically callers of tuple_insert and multi_insert will just pass all - * the flags that apply to them, and each AM has to decide which of them - * make sense for it, and then only take actions in finish_bulk_insert for - * those flags, and ignore others. - * - * Optional callback. - */ - void (*finish_bulk_insert) (Relation rel, int options); - - /* ------------------------------------------------------------------------ * DDL related functionality. * ------------------------------------------------------------------------ @@ -1087,10 +1071,6 @@ table_compute_xid_horizon_for_tuples(Relation rel, * The options bitmask allows the caller to specify options that may change the * behaviour of the AM. The AM will ignore options that it does not support. * - * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't - * need to be logged to WAL, even for a non-temp relation. It is the AMs - * choice whether this optimization is supported. - * * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse * free space in the relation. This can save some cycles when we know the * relation is new and doesn't contain useful amounts of free space. @@ -1112,8 +1092,7 @@ table_compute_xid_horizon_for_tuples(Relation rel, * heap's TOAST table, too, if the tuple requires any out-of-line data. * * The BulkInsertState object (if any; bistate can be NULL for default - * behavior) is also just passed through to RelationGetBufferForTuple. If - * `bistate` is provided, table_finish_bulk_insert() needs to be called. + * behavior) is also just passed through to RelationGetBufferForTuple. * * On return the slot's tts_tid and tts_tableOid are updated to reflect the * insertion. But note that any toasting of fields within the slot is NOT @@ -1248,6 +1227,8 @@ table_tuple_delete(Relation rel, ItemPointer tid, CommandId cid, * update was done. However, any TOAST changes in the new tuple's * data are not reflected into *newtup. * + * See table_insert about skipping WAL-logging feature. + * * In the failure cases, the routine fills *tmfd with the tuple's t_ctid, * t_xmax, and, if possible, t_cmax. See comments for struct TM_FailureData * for additional info. @@ -1308,21 +1289,6 @@ table_tuple_lock(Relation rel, ItemPointer tid, Snapshot snapshot, flags, tmfd); } -/* - * Perform operations necessary to complete insertions made via - * tuple_insert and multi_insert with a BulkInsertState specified. This - * e.g. may e.g. used to flush the relation when inserting with - * TABLE_INSERT_SKIP_WAL specified. - */ -static inline void -table_finish_bulk_insert(Relation rel, int options) -{ - /* optional callback */ - if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert) - rel->rd_tableam->finish_bulk_insert(rel, options); -} - - /* ------------------------------------------------------------------------ * DDL related functionality. * ------------------------------------------------------------------------ diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 3579d3f3eb..1c1cf5d252 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -19,6 +19,16 @@ #include "storage/smgr.h" #include "utils/relcache.h" +/* enum for operation type of PendingDelete entries */ +typedef enum PendingOpType +{ + PENDING_DELETE, + PENDING_SYNC +} PendingOpType; + +/* GUC variables */ +extern int effective_io_block_size; /* threshold for WAL-skipping */ + extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); @@ -31,7 +41,9 @@ extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst, * naming */ extern void smgrDoPendingDeletes(bool isCommit); +extern void smgrDoPendingSyncs(bool isCommit, bool sync_all); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern int smgrGetPendingSyncs(bool forCommit, RelFileNode **ptr); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 509f4b7ef1..ace5f5a2ae 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -189,6 +189,7 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation, ForkNumber forkNum); extern void FlushOneBuffer(Buffer buffer); extern void FlushRelationBuffers(Relation rel); +extern void FlushRelationBuffersWithoutRelcache(RelFileNode rnode, bool islocal); extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, BlockNumber firstDelBlock); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index c0f05e23ff..2bb2947bdb 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -42,6 +42,7 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); +extern void SyncRelationFiles(RelFileNode *syncrels, int nsyncrels); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h index c5d36680a2..f372dc2086 100644 --- a/src/include/utils/rel.h +++ b/src/include/utils/rel.h @@ -75,10 +75,17 @@ typedef struct RelationData * transaction, with one of them occurring in a subsequently aborted * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t; * ROLLBACK TO save; -- rd_newRelfilenodeSubid is now forgotten + * rd_firstRelfilenodeSubid is the ID of the first subtransaction the + * relfilenode change has took place in the current transaction. Unlike + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID + * means that the currently active relfilenode is transaction-local and we + * sync the relation at commit instead of WAL-logging. */ SubTransactionId rd_createSubid; /* rel was created in current xact */ SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in * current xact */ + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned + * first in current xact */ Form_pg_class rd_rel; /* RELATION tuple */ TupleDesc rd_att; /* tuple descriptor */ @@ -514,9 +521,15 @@ typedef struct ViewOptions /* * RelationNeedsWAL * True if relation needs WAL. + * + * Returns false if wal_level = minimal and this relation is created or + * truncated in the current transaction. */ -#define RelationNeedsWAL(relation) \ - ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) +#define RelationNeedsWAL(relation) \ + ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT && \ + (XLogIsNeeded() || \ + (relation->rd_createSubid == InvalidSubTransactionId && \ + relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId))) /* * RelationUsesLocalBuffers -- 2.16.3
>From cce02653f263211b1c777c3aac4d25423035a68d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Wed, 28 Aug 2019 14:05:30 +0900 Subject: [PATCH 3/4] Documentation for effective_io_block_size --- doc/src/sgml/config.sgml | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 89284dc5c0..2d38d897ca 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1832,6 +1832,27 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size"> + <term><varname>effective_io_block_size</varname> (<type>integer</type>) + <indexterm> + <primary><varname>effective_io_block_size</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum required duration. It is approximately the size of a track or sylinder for magnetic disks. + The value is specified in kilobytes and the default is <literal>64</literal> kilobytes. + </para> + <para> + When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>, + WAL-logging is skipped for tables created in-trasaction. If a table + is smaller than that size at commit, it is WAL-logged instead of + issueing <function>fsync</function> on it. + + </para> + </listitem> + </varlistentry> + </variablelist> </sect2> -- 2.16.3
>From b31533b895a3b239339aeb466d6f1abc0a1a4669 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp> Date: Wed, 28 Aug 2019 14:12:18 +0900 Subject: [PATCH 4/4] Additional test for new GUC setting. This patchset adds new GUC variable effective_io_block_size that controls wheter WAL-skipped tables are finally WAL-logged or fcync'ed. All of the TAP test performs WAL-logging so this adds an item that performs file sync. --- src/test/recovery/t/018_wal_optimize.pl | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl index b041121745..95063ab131 100644 --- a/src/test/recovery/t/018_wal_optimize.pl +++ b/src/test/recovery/t/018_wal_optimize.pl @@ -11,7 +11,7 @@ use warnings; use PostgresNode; use TestLib; -use Test::More tests => 26; +use Test::More tests => 28; sub check_orphan_relfilenodes { @@ -102,7 +102,23 @@ max_prepared_transactions = 1 $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2a;"); is($result, qq(1), "wal_level = $wal_level, optimized truncation with prepared transaction"); + # Same for file sync mode + # Tuples inserted after the truncation should be seen. + $node->safe_psql('postgres', " + SET effective_io_block_size to 0; + BEGIN; + CREATE TABLE test2b (id serial PRIMARY KEY); + INSERT INTO test2b VALUES (DEFAULT); + TRUNCATE test2b; + INSERT INTO test2b VALUES (DEFAULT); + COMMIT;"); + $node->stop('immediate'); + $node->start; + + $result = $node->safe_psql('postgres', "SELECT count(*) FROM test2b;"); + is($result, qq(1), + "wal_level = $wal_level, optimized truncation with file-sync"); # Data file for COPY query in follow-up tests. my $basedir = $node->basedir; -- 2.16.3