Re: Purpose of wal_init_zero

Andres Freund Wed, 15 Jan 2025 12:05:50 -0800

Hi,

On 2025-01-15 09:12:17 +0000, Andy Fan wrote:
> It is unclear to me why do we need wal_init_zero. Per comments:
>
>               /*
>                * Zero-fill the file.  With this setting, we do this the hard 
> way to
>                * ensure that all the file space has really been allocated.  On
>                * platforms that allow "holes" in files, just seeking to the 
> end
>                * doesn't allocate intermediate space.  This way, we know that 
> we
>                * have all the space and (after the fsync below) that all the
>                * indirect blocks are down on disk.  Therefore, fdatasync(2) or
>                * O_DSYNC will be sufficient to sync future writes to the log 
> file.
>                */
>
> I can understand that "the file space has really been allocated", but
> why do we care about this?


Performance.

If you create an empty segment by lseek'ing to the end (i.e. typically
resulting in a large "hole" in the file that's not yet backed by storage) or
you allocate it by calling fallocate() (allocates space, but doesn't write
it), durable writes need to do more work.

The reason for the additional work is that you don't just need write the new
WAL contents and then flush the write cache, you will also (on most, but not
all filesystems) incur a filesystem metadata write.

In case of the file-with-hole approach, the filesystem has to first allocate
space to the file, journal the relevant metadata change, probably flush that
change, then write the data, then for the fdatasync() at COMMIT another cache
flush is needed.

If your workload doesn't commit very often, compared to the rate of WAL
generation, that will often be fine. E.g. if you do bulk data load. The added
number of flushes aren't big.

However, if your workload includes a lot of small WAL writes & flushes, the
increased number of flushes can hurt rather badly.

If you have wal_recycle=true, this overhead will only be paid the first time a
WAL segment is used, of course, not after recycling.


Here's an example:

Storage is an older client oriented NVMe SSD (SAMSUNG MZVLB1T0HBLR-000L7).

To make it easier to ensure that the new WAL file case is tested, I turned
wal_recycle off.

To make the pattern of WAL easily repeatable, I'm using
pg_logical_emit_message() to emit a WAL record than then needs to be flushed
to disk, because I pass transactional true.

c=1 && \
  psql -c checkpoint -c 'select pg_switch_wal()' && \
  pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT 
pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000

wal_init_zero = 1: 885 TPS
wal_init_zero = 0: 286 TPS.


Of course I chose this case to be intentionally extreme - each transaction
fills a bit more than one page of WAL and immediately flushes it. That
guarantees that each commit needs a seperate filesystem metadata flush and a
flush of the data for the fdatasync() at commit.

If I instead emit huge WAL record / flush the WAL rarely, e.g. by passing
16*1024*1024 to repeat in the command above, the difference completely
vanishes:

wal_init_zero = 1: 6.25
wal_init_zero = 0: 6.27

If anything the init_zero path now is slower, because it needs to do less
work.

The reason it doesn't hurt to have wal_init_zero disabled in this case is that
the workloads leads to huge WAL writes, which means the additional number of
metadata flushes is very small.


Similarly, if the WAL writes/flushes are very small (say a single '0' in the
test from above), there also won't be a benefit from wal_init_zero=1, because
now most of the time we're just writing to the same WAL page as the previous
transaction, which won't require filesystem metadata changes.



Note that not all filesystems can benefit from wal_init_zero=1, e.g. ZFS or
BTRFS won't benefit, because they always allocate new disk space for each
write. With the associated overheads.


Greetings,

Andres Freund

Re: Purpose of wal_init_zero

Reply via email to