Hi, On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote: > On Wed, Jan 22, 2025 at 11:21:03AM -0500, Andres Freund wrote: > > fsync(open(oldname1)); > > fsync(open(oldname2)); > > .. > > fsync(open(oldnameN)); > > > > rename(oldname1, newname1); > > rename(oldname2, newname2); > > .. > > rename(oldnameN, newnameN); > > > > fsync(open(newname1)); > > fsync(open(newname2)); > > .. > > fsync(open(newnameN)); > > > > fsync(open("pg_wal")); > > What is the purpose of syncing the file before the rename?
It's from the general durable_rename() code. The reason it's there that it's required for "atomically replace a file" use case. Imagine the following: create_and_fill("somefile.tmp"); rename("somefile.tmp", "somefile"); fsync("somefile.tmp"); fsync("."); If you crash (OS/HW level) in the wrong moment (between rename() taking effect in-memory and the fsyncs), you might end up with "somefile" pointing to the *new* file, because the rename took affect, but the new file's content not having reached disk yet. I.e. "somefile" will be empty. Whether that's possible depends on filesystem semantics (e.g. on ext4 it's possible with data=writeback, I think it's always possible on xfs). In contrast to that, if you fsync("somefile.tmp") before the rename, a crash between rename() and the later fsyncs will have "somefile" either pointing to the *old and valid contents* or the *new and valid contents*, without a chance for an empty file. However, for the case of WAL recycling, we shouldn't need fsync() before the rename, because we ought to already have done so when creating (c.f. XLogFileInitInternal() or when recycling it last time. I suspect the theoretically superfluous fsync() won't have a meaningful performance impact most of the time though, because a) There shouldn't be any dirty data for the file, obviously we need to have flushed the WAL past the recycled segment b) Except for the first to-be-recycled segment, we just fsynced after the last rename, so there won't be any filesystem journal data that needs to be flushed I'm not entirely sure about a) though - depending on mount options it's possible that the fsync() will flush file modification times when using wal_sync_method=fdatasync. But even if that's possibly reachable, I doubt it'll be common, due to a checkpoint having to complete between the WAL flush and recycling. Could be worth experimenting with. Greetings, Andres