Hi, On 2025-01-16 14:50:57 +0530, Ritu Bhandari wrote: > Adding to Andy Fan's point above: > > If we increase WAL segment size from 16MB to 64MB, initializing the 64MB > WAL segment inline can cause several seconds of freeze on all write > transactions when it happens. Writing out a newly zero-filled 64MB WAL > segment takes several seconds for smaller disk sizes. > > Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write > 64MB, seconds > 10 0.48 5 13.33 > 32 0.48 15 4.17 > 64 0.48 31 2.08 > 128 0.48 61 1.04 > 256 0.48 123 0.52 > 500 0.48 240 0.27 > 834 0.48 400 0.16 > 1,000 0.48 480 0.13 > > > Writing full 64MB zeroes every WAL file switch will not just cause general > performance degradation, but more concerningly also makes the workload more > "jittery", by stopping all WAL writes, so all write workloads, at every WAL > switch for the time it takes to zero-fill.
I agree. But I don't think a ~2x reduction in common cases is going to be an OK price to default to disabling wal init. I think what we instead ought to do is to more aggressively initialize WAL files ahead of time, so it doesn't happen while holding crucial locks. We know the recent rate of WAL generation, and we could easily track up to which LSN we have recycled WAL segments. Armed with that information walwriter (or something else) should try to ensure that there's always a fair amount of pre-allocated WAL. If your disk only has a sequential write speed of 4.8MB/s, I don't think any nontrivial database workload is going to work well. And it obviously makes no sense whatsoever to increase the WAL segment size on such systems. I don't think we really can the smallest disks in your list work well - there's only so much we can do given the low limits and we can probably invest our time much more fruitfully by focusing on systems with disks speeds that aren't slower than spinning rust from the 1990's. That's not to say it's not worth working on preallocating WAL files. But that's not going to help much if initializing a single WAL segment is going to eat the entire bandwidth budget for 10+ seconds. > Also about WAL recycle, during our performance benchmarking, we noticed > that high volume of updates or inserts will tend to generate WAL faster > than standard checkpoint processes can keep up resulting in increased WAL > file creation (instead of rotation) and zero-filling, which significantly > degrades performance. I'm not sure I understand the specifics here - did the high WAL generation rate result in the recycling taking too long? Or did checkpointer take too long to write out data, and because of that recycling didn't happen frequently enough? > I see, PG once had fallocate [1] (which was reverted by [2] due to some > performance regression concern). The original OSS discussion was in [3]. > The perf regression was reported in [4]. Looks like this was due to how > ext4 handled extents and uninitialized data[5] and that seems to be fixed > in [6]. I'll check with Theodore Ts'o to confirm on [6]. > > Could we consider adding back fallocate? Fallocate doesn't rally help unfortunately. On common filesystems (like ext4/xfs) it just allocates filespace without zeroing out the underlying blocks. To make that correct, those filesystems keep a bitmap indicating which blocks in the range are not yet written. Unfortunately updating those blocks is a metadata operation and thus requires journaling. I've seen some mild speedups by first using fallocate and then zeroing out the file, particularly with larger segment sizes. I think mainly due to avoiding delayed allocation in the filesystem, rather than actually reducing fragmentation. But it really isn't a whole lot. I've in the past tried to get the linux filesytem developers to add an fallocate mode that doesn't utilize the "unwritten extents" "optimization", but didn't have luck with that. The block layer in linux actually does have support for zeroing out regions of blocks without having to do actually write the data, but it's only used in some narrow cases (don't remember the details). Greetings, Andres Freund