Hi, On 2025-01-22 01:14:22 +0000, Andy Fan wrote: > Andres Freund <and...@anarazel.de> writes: > > FWIW, I've seen the fsyncs around recycling being a rather substantial > > bottleneck. To the point of the main benefit of larger segments being the > > reduction in number of fsyncs at the end of a checkpoint. I think we should > > be able to make the fsyncs a lot more efficient by batching them, first > > rename > > a bunch of files, then fsync them and the directory. The current pattern > > bascially requires a separate filesystem jouranl flush for each WAL segment. > > For education purpose, how to fsync files in batch? 'man fsync' tells me > user can only fsync one file each time. > > int fsync(int fd); > > The fsync manual seems not saying fsync on a directory would fsync all > the files under that directory.
Right now we do something that essentially boils down to // recycle WAL file oldname1 fsync(open(oldname1)); rename(oldname1, newname1); fsync(open(newname1)); fsync(open("pg_wal")); // recycle WAL file oldname2 fsync(open(oldname2)); rename(oldname2, newname2); fsync(open(newname2)); fsync(open("pg_wal")); ... // recycle WAL file oldnameN fsync(open(oldnameN)); rename(oldnameN, newnameN); fsync(open(newnameN)); fsync(open("pg_wal")); ... Most of the time the fsync on oldname won't have to do any IO (because presumably we'll have flushed it before), but the rename obviously requires a metadata update and thus the fsync will have work to do (whether it's the fsync on newname or the directory will differ between filesystems). This pattern basically forces the filesystem to do at least one journal flush for every single WAL segment. I.e. each recycled segment will have at least the latency of a single synchronous durable write IO. But if we instead change it to something like this: fsync(open(oldname1)); fsync(open(oldname2)); .. fsync(open(oldnameN)); rename(oldname1, newname1); rename(oldname2, newname2); .. rename(oldnameN, newnameN); fsync(open(newname1)); fsync(open(newname2)); .. fsync(open(newnameN)); fsync(open("pg_wal")); Most filesystems will be able to combine many of the the journal flushes triggered by the renames into much bigger journal flushes. That means the overall time for recycling is much lower than the earlier one, since there are far fewer synchronous durable writes. Here's a rough approximation of the effect using shell commands: andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv test.$i.old test.$i.new; sync; done;) real 0m7.218s user 0m0.431s sys 0m4.892s andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv test.$i.old test.$i.new; done; sync) real 0m2.678s user 0m0.282s sys 0m2.402s The only difference between the two versions is that the latter can combine the journal flushes, due to the sync happening outside of the loop. This is a somewhat poor approximation of how this would work in postgres, including likely exaggerating the gain (I think sync flushes the filesystem superblock too), but it does show the principle. Greetings, Andres Freund