Re: Purpose of wal_init_zero

2025-01-21 Thread Ritu Bhandari
Hi @Andres Freund 

> I'm not sure I understand the specifics here - did the high WAL generation
> rate result in the recycling taking too long?  Or did checkpointer take
too
> long to write out data, and because of that recycling didn't happen
frequently
> enough?

If the WAL generation rate highly exceeds the max_wal_size within a
checkpoint interval,  there aren't enough recycled WAL files available, the
system will create a large number of new WAL files. This can significantly
increase the initialization time, especially if we've increased the WAL
segment size to 64 MB (which would be 4x of 16 MB segment size).
Conversely, setting a very high max_wal_size to retain more recycled WAL
files can lead to longer recovery times, as the total WAL size might become
very large.

I'll talk to Theodore and confirm on the fallocate part.


> I think what we instead ought to do is to more aggressively initialize WAL
> files ahead of time, so it doesn't happen while holding crucial locks.  We
> know the recent rate of WAL generation, and we could easily track up to
which
> LSN we have recycled WAL segments. Armed with that information walwriter
(or
> something else) should try to ensure that there's always a fair amount of
> pre-allocated WAL.

I agree. Having preallocated WAL files ahead of time will be the ideal
scenario.


> I put some patches together for this a few years ago [0], but ended up
> abandoning them due to lack of interest.  I'm happy to revisit that effort
> if folks do become interested.

Great to know about this, and it aligns with our thinking. We can continue
the discussion on the other thread. I can also help wherever needed.



On Tue, 21 Jan 2025 at 06:39, Andy Fan  wrote:

>
> Hi,
>
> > On Fri, Jan 17, 2025 at 04:29:14PM -0500, Andres Freund wrote:
> >> I think what we instead ought to do is to more aggressively initialize
> WAL
> >> files ahead of time, so it doesn't happen while holding crucial locks.
> We
> >> know the recent rate of WAL generation, and we could easily track up to
> which
> >> LSN we have recycled WAL segments. Armed with that information
> walwriter (or
> >> something else) should try to ensure that there's always a fair amount
> of
> >> pre-allocated WAL.
> >
> > I put some patches together for this a few years ago [0], but ended up
> > abandoning them due to lack of interest.  I'm happy to revisit that
> effort
> > if folks do become interested.
>
> Great to know this, I went through that thread and found the main
> considerations are pretty similar with what I am thinking when working
> out the Poc. I will go to [0] for further dicussion on this topic.
>
> > [0] https://postgr.es/m/20220408203003.GA1630183%40nathanxps13
> --
> Best Regards
> Andy Fan
>
>


Re: Purpose of wal_init_zero

2025-01-16 Thread Ritu Bhandari
Hi,

Adding to Andy Fan's point above:

If we increase WAL segment size from 16MB to 64MB, initializing the 64MB
WAL segment inline can cause several seconds of freeze on all write
transactions when it happens. Writing out a newly zero-filled 64MB WAL
segment takes several seconds for smaller disk sizes.

Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write
64MB, seconds
10 0.48 5 13.33
32 0.48 15 4.17
64 0.48 31 2.08
128 0.48 61 1.04
256 0.48 123 0.52
500 0.48 240 0.27
834 0.48 400 0.16
1,000 0.48 480 0.13


Writing full 64MB zeroes every WAL file switch will not just cause general
performance degradation, but more concerningly also makes the workload more
"jittery", by stopping all WAL writes, so all write workloads, at every WAL
switch for the time it takes to zero-fill.

Also about WAL recycle, during our performance benchmarking, we noticed
that high volume of updates or inserts will tend to generate WAL faster
than standard checkpoint processes can keep up resulting in increased WAL
file creation (instead of rotation) and zero-filling, which significantly
degrades performance.

I see, PG once had fallocate [1] (which was reverted by [2] due to some
performance regression concern). The original OSS discussion was in [3].
The perf regression was reported in [4]. Looks like this was due to how
ext4 handled extents and uninitialized data[5] and that seems to be fixed
in [6]. I'll check with Theodore Ts'o to confirm on [6].

Could we consider adding back fallocate?

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=269e780
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b571bb
[3]
https://www.postgresql.org/message-id/flat/CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB%2B-faQ%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/flat/CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAKuK5J3R-oBh%2B9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ%40mail.gmail.com
[6]
https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

Thanks,
-Ritu

On Thu, 16 Jan 2025 at 12:01, Andy Fan  wrote:

>
> Hi,
>
> >
> > c=1 && \
> >   psql -c checkpoint -c 'select pg_switch_wal()' && \
> >   pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1
> >
> > wal_init_zero = 1: 885 TPS
> > wal_init_zero = 0: 286 TPS.
>
> Your theory looks clear and the result is promsing. I can reproduce the
> similar result in my setup.
>
> on: tps = 1588.538378 (without initial connection time)
> off: tps = 857.755343 (without initial connection time)
>
> > Of course I chose this case to be intentionally extreme - each
> transaction
> > fills a bit more than one page of WAL and immediately flushes it. That
> > guarantees that each commit needs a seperate filesystem metadata flush
> and a
> > flush of the data for the fdatasync() at commit.
>
> However if I increase the clients from 1 to 64(this may break this
> extrme because of group commit) then we can see the wal_init_zero caused
> noticable regression.
>
> c=64 && \
>psql -c checkpoint -c 'select pg_switch_wal()' && \
>pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 1
>
> off:
> tps = 12135.110730 (without initial connection time)
> tps = 11964.016277 (without initial connection time)
> tps = 12078.458724 (without initial connection time)
>
> on:
> tps = 9392.374563 (without initial connection time)
> tps = 9391.916410 (without initial connection time)
> tps = 9390.503777 (without initial connection time)
>
> Now the wal_init_zero happens on the user backend and other backends also
> need to wait for it, this looks not good to me. I find walwriter doesn't
> do much things, I'd like to have a try if we can offload wal_init_zero
> to the walwriter.
>
> About the wal_recycle, IIUC, it can only recycle a wal file during
> Checkpoint, but checkpoint doesn't happens often.
>
> --
> Best Regards
> Andy Fan
>
>
>
>