Hi,

On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote:
> On Wed, Jan 22, 2025 at 11:21:03AM -0500, Andres Freund wrote:
> > fsync(open(oldname1));
> > fsync(open(oldname2));
> > ..
> > fsync(open(oldnameN));
> >
> > rename(oldname1, newname1);
> > rename(oldname2, newname2);
> > ..
> > rename(oldnameN, newnameN);
> >
> > fsync(open(newname1));
> > fsync(open(newname2));
> > ..
> > fsync(open(newnameN));
> >
> > fsync(open("pg_wal"));
>
> What is the purpose of syncing the file before the rename?

It's from the general durable_rename() code. The reason it's there that it's
required for "atomically replace a file" use case. Imagine the following:

create_and_fill("somefile.tmp");
rename("somefile.tmp", "somefile");
fsync("somefile.tmp");
fsync(".");

If you crash (OS/HW level) in the wrong moment (between rename() taking effect
in-memory and the fsyncs), you might end up with "somefile" pointing to the
*new* file, because the rename took affect, but the new file's content not
having reached disk yet. I.e. "somefile" will be empty.  Whether that's
possible depends on filesystem semantics (e.g. on ext4 it's possible with
data=writeback, I think it's always possible on xfs).

In contrast to that, if you fsync("somefile.tmp") before the rename, a crash
between rename() and the later fsyncs will have "somefile" either pointing to
the *old and valid contents* or the *new and valid contents*, without a chance
for an empty file.


However, for the case of WAL recycling, we shouldn't need fsync() before the
rename, because we ought to already have done so when creating
(c.f. XLogFileInitInternal() or when recycling it last time.


I suspect the theoretically superfluous fsync() won't have a meaningful
performance impact most of the time though, because

a) There shouldn't be any dirty data for the file, obviously we need to have
   flushed the WAL past the recycled segment

b) Except for the first to-be-recycled segment, we just fsynced after the last
   rename, so there won't be any filesystem journal data that needs to be
   flushed

I'm not entirely sure about a) though - depending on mount options it's
possible that the fsync() will flush file modification times when using
wal_sync_method=fdatasync.  But even if that's possibly reachable, I doubt
it'll be common, due to a checkpoint having to complete between the WAL flush
and recycling. Could be worth experimenting with.


Greetings,

Andres


Reply via email to