Hi, On 2025-01-15 09:12:17 +0000, Andy Fan wrote: > It is unclear to me why do we need wal_init_zero. Per comments: > > /* > * Zero-fill the file. With this setting, we do this the hard > way to > * ensure that all the file space has really been allocated. On > * platforms that allow "holes" in files, just seeking to the > end > * doesn't allocate intermediate space. This way, we know that > we > * have all the space and (after the fsync below) that all the > * indirect blocks are down on disk. Therefore, fdatasync(2) or > * O_DSYNC will be sufficient to sync future writes to the log > file. > */ > > I can understand that "the file space has really been allocated", but > why do we care about this?
Performance. If you create an empty segment by lseek'ing to the end (i.e. typically resulting in a large "hole" in the file that's not yet backed by storage) or you allocate it by calling fallocate() (allocates space, but doesn't write it), durable writes need to do more work. The reason for the additional work is that you don't just need write the new WAL contents and then flush the write cache, you will also (on most, but not all filesystems) incur a filesystem metadata write. In case of the file-with-hole approach, the filesystem has to first allocate space to the file, journal the relevant metadata change, probably flush that change, then write the data, then for the fdatasync() at COMMIT another cache flush is needed. If your workload doesn't commit very often, compared to the rate of WAL generation, that will often be fine. E.g. if you do bulk data load. The added number of flushes aren't big. However, if your workload includes a lot of small WAL writes & flushes, the increased number of flushes can hurt rather badly. If you have wal_recycle=true, this overhead will only be paid the first time a WAL segment is used, of course, not after recycling. Here's an example: Storage is an older client oriented NVMe SSD (SAMSUNG MZVLB1T0HBLR-000L7). To make it easier to ensure that the new WAL file case is tested, I turned wal_recycle off. To make the pattern of WAL easily repeatable, I'm using pg_logical_emit_message() to emit a WAL record than then needs to be flushed to disk, because I pass transactional true. c=1 && \ psql -c checkpoint -c 'select pg_switch_wal()' && \ pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000 wal_init_zero = 1: 885 TPS wal_init_zero = 0: 286 TPS. Of course I chose this case to be intentionally extreme - each transaction fills a bit more than one page of WAL and immediately flushes it. That guarantees that each commit needs a seperate filesystem metadata flush and a flush of the data for the fdatasync() at commit. If I instead emit huge WAL record / flush the WAL rarely, e.g. by passing 16*1024*1024 to repeat in the command above, the difference completely vanishes: wal_init_zero = 1: 6.25 wal_init_zero = 0: 6.27 If anything the init_zero path now is slower, because it needs to do less work. The reason it doesn't hurt to have wal_init_zero disabled in this case is that the workloads leads to huge WAL writes, which means the additional number of metadata flushes is very small. Similarly, if the WAL writes/flushes are very small (say a single '0' in the test from above), there also won't be a benefit from wal_init_zero=1, because now most of the time we're just writing to the same WAL page as the previous transaction, which won't require filesystem metadata changes. Note that not all filesystems can benefit from wal_init_zero=1, e.g. ZFS or BTRFS won't benefit, because they always allocate new disk space for each write. With the associated overheads. Greetings, Andres Freund