While reviewing the patches in http://postgr.es/m/20191126.213752.2132434859202124793.horikyota....@gmail.com I noticed three related problems. The defects motivating that other thread are specific to wal_level=minimal. The related problems are not specific to any wal_level, hence my starting a separate thread.
1. We don't sync unlogged tables after skipFsync=true operations. smgrwrite() explains the contract for skipFsync=true. Concretely, it causes md.c to skip enqueueing the relation to be synced during the next checkpoint. Three places in the tree call smgrextend(..., skipFsync=true) and/or smgrwrite(..., skipFsync=true): rewriteheap.c (CLUSTER, VACUUM FULL) _bt_blwritepage() (CREATE INDEX) RelationCopyStorage() (ALTER TABLE SET TABLESPACE) They use logic like this to decide when to sync: /* * If the rel is WAL-logged, must fsync before commit. We use heap_sync * to ensure that the toast table gets fsync'd too. (For a temp or * unlogged rel we don't care since the data will be gone after a crash * anyway.) * * It's obvious that we must do this when not WAL-logging the copy. It's * less obvious that we have to do it even if we did WAL-log the copied * pages. The reason is that since we're copying outside shared buffers, a * CHECKPOINT occurring during the copy has no way to flush the previously * written data to disk (indeed it won't know the new rel even exists). A * crash later on would replay WAL from the checkpoint, therefore it * wouldn't replay our earlier WAL entries. If we do not fsync those pages * here, they might still not be on disk when the crash occurs. */ if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork) smgrimmedsync(dst, forkNum); However, the reasoning about unlogged rels is incomplete. Normally, we sync unlogged rels during each checkpoint after we wrote out a dirty buffer, just as we do for permanent rels. (It would be enough to sync them during the shutdown checkpoint. That would require a data structure to track unlogged relation files dirtied since StartupXLOG().) However, due to the code above, we miss syncing them if one of these DDL operations is the last operation changing the unlogged rel before a shutdown checkpoint. I've attached a test-only patch demonstrating the problem. This can cause unlogged tables to have invalid contents if an OS crash happens with PostgreSQL stopped. I think the fix is simple: test "relpersistence != RELPERSISTENCE_TEMP" instead. 2. RelationTruncate() needs defense against concurrent checkpoints. In RelationTruncate(), nothing stops a checkpoint from starting and completing after XLogInsert(... XLOG_SMGR_TRUNCATE) and before smgrtruncate(). That checkpoint could move the redo pointer past the XLOG_SMGR_TRUNCATE record, making the outcome equivalent to having never written the WAL record. If the OS crashes before the first post-smgrtruncate() checkpoint, the filesystem may forget some or all ftruncate() calls. The symptom would be unexpected tuples in the relation. Example test procedure: BEGIN; CREATE TABLE t (c) AS SELECT 1; -- set breakpoint at XLogFlush() TRUNCATE t; -- hit breakpoint -- while stopped at breakpoint, issue CHECKPOINT, then release debugger COMMIT; -- hard crash forgets some ftruncate() pg_ctl -w start # REDO does not reissue ftruncate() The fix is to set delayChkpt before writing WAL and clear it after smgrtruncate(), like we do in EndPrepare(). 3. smgrtruncate()/mdtruncate() is not qualified to be called in recovery. smgr_redo() calls smgrtruncate() to replay XLOG_SMGR_TRUNCATE records. However, mdtruncate() relies on the following md.c invariant: * On disk, a relation must consist of consecutively numbered segment * files in the pattern * -- Zero or more full segments of exactly RELSEG_SIZE blocks each * -- Exactly one partial segment of size 0 <= size < RELSEG_SIZE blocks * -- Optionally, any number of inactive segments of size 0 blocks. That invariant does not hold before reachedConsistency. Suppose the unclean shutdown happened after the original (non-recovery) mdtruncate() and before anything synced the truncated segments. The OS might remember the ftruncate() of segment zero and lose the ftruncate() of segment one, upsetting the invariant. The symptom would again be unexpected tuples in the relation. My preferred fix is to make mdtruncate() truncate each inactive segment, stopping when opening a segment fails with ENOENT, like mdunlinkfork() does. An alternative would be to add a last_segment_truncated field to xl_smgr_truncate, then truncate up to that segment. Does anyone see a way to improve on those proposed fixes? Thanks, nm
diff --git a/src/test/recovery/t/019_ddl.pl b/src/test/recovery/t/019_ddl.pl new file mode 100644 index 0000000..657febe --- /dev/null +++ b/src/test/recovery/t/019_ddl.pl @@ -0,0 +1,83 @@ +# +# Test that DDL arranges for sufficient syncs. +# +use strict; +use warnings; +use PostgresNode; +use Test::More; +use TestLib; + +plan tests => 6; + +my $node = PostgresNode->get_new_node('solo'); +$node->init(); +# Avoid time-based checkpoints, so checkpoints happen when we choose. +$node->append_conf('postgresql.conf', 'checkpoint_timeout = 60min'); +# If fsync were disabled, the server would skip relevant code. +$node->append_conf('postgresql.conf', 'fsync = on'); +# Log the "checkpoint sync:" messages. +$node->append_conf('postgresql.conf', 'log_checkpoints = on'); +$node->append_conf('postgresql.conf', 'log_min_messages = debug1'); +# Test at the default wal_level. +$node->append_conf('postgresql.conf', 'wal_level = replica'); + +# UNLOGGED relations. The variants with an INSERT after the DDL are the easy +# case, because the INSERT arranges for a sync. The others rely on DDL making +# correct arrangements. +$node->start; +my $tablespace_dir = $node->basedir . '/tablespace_other'; +mkdir($tablespace_dir); +$tablespace_dir = TestLib::perl2host($tablespace_dir); +$node->safe_psql('postgres', + "CREATE TABLESPACE other LOCATION '$tablespace_dir';"); +my $tablespace_insert_filepath = $node->safe_psql('postgres', <<EOSQL); +CREATE UNLOGGED TABLE spc_ins (c int); +INSERT INTO spc_ins VALUES (1); +ALTER TABLE spc_ins SET TABLESPACE other; +INSERT INTO spc_ins VALUES (2); +SELECT pg_relation_filepath('spc_ins'); +EOSQL +my $tablespace_filepath = $node->safe_psql('postgres', <<EOSQL); +CREATE UNLOGGED TABLE spc (c int); +INSERT INTO spc VALUES (1); +ALTER TABLE spc SET TABLESPACE other; +SELECT pg_relation_filepath('spc'); +EOSQL +my $cluster_insert_filepath = $node->safe_psql('postgres', <<EOSQL); +CREATE UNLOGGED TABLE clu_ins (c int PRIMARY KEY); +INSERT INTO clu_ins VALUES (1); +CLUSTER clu_ins USING clu_ins_pkey; +INSERT INTO clu_ins VALUES (2); +SELECT pg_relation_filepath('clu_ins'); +EOSQL +my $cluster_filepath = $node->safe_psql('postgres', <<EOSQL); +CREATE UNLOGGED TABLE clu (c int PRIMARY KEY); +INSERT INTO clu VALUES (1); +CLUSTER clu USING clu_pkey; +SELECT pg_relation_filepath('clu'); +EOSQL +my $index_insert_filepath = $node->safe_psql('postgres', <<EOSQL); +SELECT pg_relation_filepath('clu_ins_pkey'); +EOSQL +my $index_filepath = $node->safe_psql('postgres', <<EOSQL); +SELECT pg_relation_filepath('clu_pkey'); +EOSQL +$node->stop; +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$tablespace_insert_filepath/, + 'sync unlogged table after SET TABLEPSACE+INSERT'); +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$tablespace_filepath/, + 'sync unlogged table after SET TABLESPACE'); +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$cluster_insert_filepath/, + 'sync unlogged table after CLUSTER+INSERT'); +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$cluster_filepath/, + 'sync unlogged table after CLUSTER'); +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$index_insert_filepath/, + 'sync unlogged index after CREATE+INSERT'); +like(slurp_file($node->logfile), + qr/checkpoint sync: number=[0-9]* file=$index_filepath/, + 'sync unlogged index after CREATE');