Re: Pre-allocating WAL files

Andres Freund Wed, 22 Jan 2025 08:21:35 -0800

Hi,

On 2025-01-22 01:14:22 +0000, Andy Fan wrote:
> Andres Freund <and...@anarazel.de> writes:
> > FWIW, I've seen the fsyncs around recycling being a rather substantial
> > bottleneck. To the point of the main benefit of larger segments being the
> > reduction in number of fsyncs at the end of a checkpoint.  I think we should
> > be able to make the fsyncs a lot more efficient by batching them, first 
> > rename
> > a bunch of files, then fsync them and the directory. The current pattern
> > bascially requires a separate filesystem jouranl flush for each WAL segment.
> 
> For education purpose, how to fsync files in batch? 'man fsync' tells me
> user can only fsync one file each time.
> 
> int fsync(int fd);
> 
> The fsync manual seems not saying fsync on a directory would fsync all
> the files under that directory.


Right now we do something that essentially boils down to

// recycle WAL file oldname1
fsync(open(oldname1));
rename(oldname1, newname1);
fsync(open(newname1));
fsync(open("pg_wal"));

// recycle WAL file oldname2
fsync(open(oldname2));
rename(oldname2, newname2);
fsync(open(newname2));
fsync(open("pg_wal"));
...

// recycle WAL file oldnameN
fsync(open(oldnameN));
rename(oldnameN, newnameN);
fsync(open(newnameN));
fsync(open("pg_wal"));
...

Most of the time the fsync on oldname won't have to do any IO (because
presumably we'll have flushed it before), but the rename obviously requires a
metadata update and thus the fsync will have work to do (whether it's the
fsync on newname or the directory will differ between filesystems).

This pattern basically forces the filesystem to do at least one journal flush
for every single WAL segment. I.e. each recycled segment will have at least
the latency of a single synchronous durable write IO.


But if we instead change it to something like this:

fsync(open(oldname1));
fsync(open(oldname2));
..
fsync(open(oldnameN));

rename(oldname1, newname1);
rename(oldname2, newname2);
..
rename(oldnameN, newnameN);

fsync(open(newname1));
fsync(open(newname2));
..
fsync(open(newnameN));

fsync(open("pg_wal"));


Most filesystems will be able to combine many of the the journal flushes
triggered by the renames into much bigger journal flushes. That means the
overall time for recycling is much lower than the earlier one, since there are
far fewer synchronous durable writes.


Here's a rough approximation of the effect using shell commands:

andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 
$N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv 
test.$i.old test.$i.new; sync; done;)

real    0m7.218s
user    0m0.431s
sys     0m4.892s

andres@awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 
$N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv 
test.$i.old test.$i.new; done; sync)

real    0m2.678s
user    0m0.282s
sys     0m2.402s


The only difference between the two versions is that the latter can combine
the journal flushes, due to the sync happening outside of the loop.


This is a somewhat poor approximation of how this would work in postgres,
including likely exaggerating the gain (I think sync flushes the filesystem
superblock too), but it does show the principle.


Greetings,

Andres Freund

Re: Pre-allocating WAL files

Reply via email to