Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels

Noah Misch Fri, 29 Nov 2019 22:05:50 -0800

While reviewing the patches in
http://postgr.es/m/20191126.213752.2132434859202124793.horikyota....@gmail.com
I noticed three related problems.  The defects motivating that other thread
are specific to wal_level=minimal.  The related problems are not specific to
any wal_level, hence my starting a separate thread.



1. We don't sync unlogged tables after skipFsync=true operations.

smgrwrite() explains the contract for skipFsync=true.  Concretely, it causes
md.c to skip enqueueing the relation to be synced during the next checkpoint.
Three places in the tree call smgrextend(..., skipFsync=true) and/or
smgrwrite(..., skipFsync=true):

  rewriteheap.c (CLUSTER, VACUUM FULL)
  _bt_blwritepage() (CREATE INDEX)
  RelationCopyStorage() (ALTER TABLE SET TABLESPACE)

They use logic like this to decide when to sync:

        /*
         * If the rel is WAL-logged, must fsync before commit.  We use heap_sync
         * to ensure that the toast table gets fsync'd too.  (For a temp or
         * unlogged rel we don't care since the data will be gone after a crash
         * anyway.)
         *
         * It's obvious that we must do this when not WAL-logging the copy. It's
         * less obvious that we have to do it even if we did WAL-log the copied
         * pages. The reason is that since we're copying outside shared 
buffers, a
         * CHECKPOINT occurring during the copy has no way to flush the 
previously
         * written data to disk (indeed it won't know the new rel even exists). 
 A
         * crash later on would replay WAL from the checkpoint, therefore it
         * wouldn't replay our earlier WAL entries. If we do not fsync those 
pages
         * here, they might still not be on disk when the crash occurs.
         */
        if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
                smgrimmedsync(dst, forkNum);

However, the reasoning about unlogged rels is incomplete.  Normally, we sync
unlogged rels during each checkpoint after we wrote out a dirty buffer, just
as we do for permanent rels.  (It would be enough to sync them during the
shutdown checkpoint.  That would require a data structure to track unlogged
relation files dirtied since StartupXLOG().)  However, due to the code above,
we miss syncing them if one of these DDL operations is the last operation
changing the unlogged rel before a shutdown checkpoint.  I've attached a
test-only patch demonstrating the problem.  This can cause unlogged tables to
have invalid contents if an OS crash happens with PostgreSQL stopped.

I think the fix is simple: test "relpersistence != RELPERSISTENCE_TEMP"
instead.


2. RelationTruncate() needs defense against concurrent checkpoints.

In RelationTruncate(), nothing stops a checkpoint from starting and completing
after XLogInsert(... XLOG_SMGR_TRUNCATE) and before smgrtruncate().  That
checkpoint could move the redo pointer past the XLOG_SMGR_TRUNCATE record,
making the outcome equivalent to having never written the WAL record.  If the
OS crashes before the first post-smgrtruncate() checkpoint, the filesystem may
forget some or all ftruncate() calls.  The symptom would be unexpected tuples
in the relation.  Example test procedure:

BEGIN;
CREATE TABLE t (c) AS SELECT 1;
-- set breakpoint at XLogFlush()
TRUNCATE t; -- hit breakpoint
-- while stopped at breakpoint, issue CHECKPOINT, then release debugger
COMMIT;
-- hard crash forgets some ftruncate()
pg_ctl -w start  # REDO does not reissue ftruncate()

The fix is to set delayChkpt before writing WAL and clear it after
smgrtruncate(), like we do in EndPrepare().


3. smgrtruncate()/mdtruncate() is not qualified to be called in recovery.

smgr_redo() calls smgrtruncate() to replay XLOG_SMGR_TRUNCATE records.
However, mdtruncate() relies on the following md.c invariant:

 *      On disk, a relation must consist of consecutively numbered segment
 *      files in the pattern
 *              -- Zero or more full segments of exactly RELSEG_SIZE blocks each
 *              -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE 
blocks
 *              -- Optionally, any number of inactive segments of size 0 blocks.

That invariant does not hold before reachedConsistency.  Suppose the unclean
shutdown happened after the original (non-recovery) mdtruncate() and before
anything synced the truncated segments.  The OS might remember the ftruncate()
of segment zero and lose the ftruncate() of segment one, upsetting the
invariant.  The symptom would again be unexpected tuples in the relation.

My preferred fix is to make mdtruncate() truncate each inactive segment,
stopping when opening a segment fails with ENOENT, like mdunlinkfork() does.
An alternative would be to add a last_segment_truncated field to
xl_smgr_truncate, then truncate up to that segment.


Does anyone see a way to improve on those proposed fixes?

Thanks,
nm

diff --git a/src/test/recovery/t/019_ddl.pl b/src/test/recovery/t/019_ddl.pl
new file mode 100644
index 0000000..657febe
--- /dev/null
+++ b/src/test/recovery/t/019_ddl.pl
@@ -0,0 +1,83 @@
+#
+# Test that DDL arranges for sufficient syncs.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use Test::More;
+use TestLib;
+
+plan tests => 6;
+
+my $node = PostgresNode->get_new_node('solo');
+$node->init();
+# Avoid time-based checkpoints, so checkpoints happen when we choose.
+$node->append_conf('postgresql.conf', 'checkpoint_timeout = 60min');
+# If fsync were disabled, the server would skip relevant code.
+$node->append_conf('postgresql.conf', 'fsync = on');
+# Log the "checkpoint sync:" messages.
+$node->append_conf('postgresql.conf', 'log_checkpoints = on');
+$node->append_conf('postgresql.conf', 'log_min_messages = debug1');
+# Test at the default wal_level.
+$node->append_conf('postgresql.conf', 'wal_level = replica');
+
+# UNLOGGED relations.  The variants with an INSERT after the DDL are the easy
+# case, because the INSERT arranges for a sync.  The others rely on DDL making
+# correct arrangements.
+$node->start;
+my $tablespace_dir = $node->basedir . '/tablespace_other';
+mkdir($tablespace_dir);
+$tablespace_dir = TestLib::perl2host($tablespace_dir);
+$node->safe_psql('postgres',
+				 "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+my $tablespace_insert_filepath = $node->safe_psql('postgres', <<EOSQL);
+CREATE UNLOGGED TABLE spc_ins (c int);
+INSERT INTO spc_ins VALUES (1);
+ALTER TABLE spc_ins SET TABLESPACE other;
+INSERT INTO spc_ins VALUES (2);
+SELECT pg_relation_filepath('spc_ins');
+EOSQL
+my $tablespace_filepath = $node->safe_psql('postgres', <<EOSQL);
+CREATE UNLOGGED TABLE spc (c int);
+INSERT INTO spc VALUES (1);
+ALTER TABLE spc SET TABLESPACE other;
+SELECT pg_relation_filepath('spc');
+EOSQL
+my $cluster_insert_filepath = $node->safe_psql('postgres', <<EOSQL);
+CREATE UNLOGGED TABLE clu_ins (c int PRIMARY KEY);
+INSERT INTO clu_ins VALUES (1);
+CLUSTER clu_ins USING clu_ins_pkey;
+INSERT INTO clu_ins VALUES (2);
+SELECT pg_relation_filepath('clu_ins');
+EOSQL
+my $cluster_filepath = $node->safe_psql('postgres', <<EOSQL);
+CREATE UNLOGGED TABLE clu (c int PRIMARY KEY);
+INSERT INTO clu VALUES (1);
+CLUSTER clu USING clu_pkey;
+SELECT pg_relation_filepath('clu');
+EOSQL
+my $index_insert_filepath = $node->safe_psql('postgres', <<EOSQL);
+SELECT pg_relation_filepath('clu_ins_pkey');
+EOSQL
+my $index_filepath = $node->safe_psql('postgres', <<EOSQL);
+SELECT pg_relation_filepath('clu_pkey');
+EOSQL
+$node->stop;
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$tablespace_insert_filepath/,
+	 'sync unlogged table after SET TABLEPSACE+INSERT');
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$tablespace_filepath/,
+	 'sync unlogged table after SET TABLESPACE');
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$cluster_insert_filepath/,
+	 'sync unlogged table after CLUSTER+INSERT');
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$cluster_filepath/,
+	 'sync unlogged table after CLUSTER');
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$index_insert_filepath/,
+	 'sync unlogged index after CREATE+INSERT');
+like(slurp_file($node->logfile),
+	 qr/checkpoint sync: number=[0-9]* file=$index_filepath/,
+	 'sync unlogged index after CREATE');

Recovery vs. RelationTruncate(); skipFsync vs. unlogged rels

Reply via email to