On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote: > On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote: > > On 05/09/2023 21:20, Robert Haas wrote: > > > Thinking about this some more, I think this is still not 100% correct, even > > with the patch I posted earlier: > > > > > /* > > > * When we WAL-logged rel pages, we must nonetheless fsync them. The > > > * reason is that since we're copying outside shared buffers, a > > > CHECKPOINT > > > * occurring during the copy has no way to flush the previously written > > > * data to disk (indeed it won't know the new rel even exists). A crash > > > * later on would replay WAL from the checkpoint, therefore it wouldn't > > > * replay our earlier WAL entries. If we do not fsync those pages here, > > > * they might still not be on disk when the crash occurs. > > > */ > > > if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED) > > > smgrimmedsync(dst, forkNum); > > > > Let's walk through each case one by one: > > > > 1. Temporary table. No fsync() needed. This is correct. > > > > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and > > also because we bypassed the buffer manager. Correct. > > Agreed. > > > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged > > it, because we bypassed the buffer manager. Like the comment explains. This > > is now correct with the patch. > > This has two subtypes: > > 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what > you wrote. > > 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be > fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call > AddPendingSync(). (RelationCreateStorage() could start calling > AddPendingSync() for this case. I think we chose not to do that because there > will never be additional writes to the init fork, and smgrDoPendingSyncs() > would send the fork to FlushRelationsAllBuffers() even though the fork will > never appear in shared buffers. On the other hand, grouping the sync with the > other end-of-xact syncs could improve efficiency for some filesystems. Also, > the number of distinguishable cases is unpleasantly high.) > > > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we > > WAL-logged it, because we bypassed the buffer manager. Like the comment > > explains. Correct. > > > > 5. Permanent rel, use_wal == false. We skip fsync, because it means that > > it's a new relation, so we have a sync request pending for it. (An assertion > > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(), > > we will either fsync it or we will WAL-log all the pages if it was a small > > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to > > WAL-log the pages, we have the same race condition that's explained in the > > comment, because we bypassed the buffer manager: > > > > 1. RelationCopyStorage() copies some of the pages. > > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a > > dirty segment when the relation was created) > > 3. RelationCopyStorage() copies the rest of the pages. > > 4. smgrDoPendingSyncs() WAL-logs all the pages. > > smgrDoPendingSyncs() delegates to log_newpage_range(). log_newpage_range() > loads each page into the buffer manager and calls MarkBufferDirty() on each. > Hence, ... > > > 5. Another checkpoint happens. It does *not* fsync the relation. > > ... I think this will fsync the relation. No? > > > 6. Crash.
While we're cataloging gaps, I think the middle sentence is incorrect in the following heapam_relation_set_new_filelocator() comment: /* * If required, set up an init fork for an unlogged table so that it can * be correctly reinitialized on restart. Recovery may remove it while * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE * record. Therefore, logging is necessary even if wal_level=minimal. */ if (persistence == RELPERSISTENCE_UNLOGGED) { Assert(rel->rd_rel->relkind == RELKIND_RELATION || rel->rd_rel->relkind == RELKIND_MATVIEW || rel->rd_rel->relkind == RELKIND_TOASTVALUE); smgrcreate(srel, INIT_FORKNUM, false); log_smgrcreate(newrlocator, INIT_FORKNUM); } XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06) made it issue a checkpoint. XLOG_DBASE_CREATE_WAL_LOG never had that problem. XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08). In general, if we reintroduced such a bug, it would affect all new rels under wal_level=minimal, not just init forks. Having said all that, log_smgrcreate() calls are never conditional on wal_level=minimal; the above code is correct.