Re: Unlogged relation copy is not fsync'd

Noah Misch Sat, 07 Oct 2023 19:22:27 -0700

On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
> 
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> > 
> > >   /*
> > >    * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> > >    * reason is that since we're copying outside shared buffers, a 
> > > CHECKPOINT
> > >    * occurring during the copy has no way to flush the previously written
> > >    * data to disk (indeed it won't know the new rel even exists).  A crash
> > >    * later on would replay WAL from the checkpoint, therefore it wouldn't
> > >    * replay our earlier WAL entries. If we do not fsync those pages here,
> > >    * they might still not be on disk when the crash occurs.
> > >    */
> > >   if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > >           smgrimmedsync(dst, forkNum);
> > 
> > Let's walk through each case one by one:
> > 
> > 1. Temporary table. No fsync() needed. This is correct.
> > 
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
> 
> Agreed.
> 
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
> 
> This has two subtypes:
> 
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
> 
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync().  (RelationCreateStorage() could start calling
> AddPendingSync() for this case.  I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers.  On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems.  Also,
> the number of distinguishable cases is unpleasantly high.)
> 
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> > 
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> > 
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
> 
> smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
> 
> > 5. Another checkpoint happens. It does *not* fsync the relation.
> 
> ... I think this will fsync the relation.  No?
> 
> > 6. Crash.


While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:

        /*
         * If required, set up an init fork for an unlogged table so that it can
         * be correctly reinitialized on restart.  Recovery may remove it while
         * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
         * record.  Therefore, logging is necessary even if wal_level=minimal.
         */
        if (persistence == RELPERSISTENCE_UNLOGGED)
        {
                Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
                           rel->rd_rel->relkind == RELKIND_MATVIEW ||
                           rel->rd_rel->relkind == RELKIND_TOASTVALUE);
                smgrcreate(srel, INIT_FORKNUM, false);
                log_smgrcreate(newrlocator, INIT_FORKNUM);
        }

XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint.  XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08).  In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks.  Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.

Re: Unlogged relation copy is not fsync'd

Reply via email to