Re: pg_combinebackup --copy-file-range

Thomas Munro Mon, 01 Apr 2024 14:46:08 -0700

On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra
<[email protected]> wrote:
> And I think he's right, and my tests confirm this. I did a trivial patch
> to align the blocks to 8K boundary, by forcing the header to be a
> multiple of 8K (I think 4K alignment would be enough). See the 0001
> patch that does this.
>
> And if I measure the disk space used by pg_combinebackup, and compare
> the results with results without the patch ([1] from a couple days
> back), I see this:
>
>    pct      not aligned        aligned
>   -------------------------------------
>     1%             689M            19M
>    10%            3172M            22M
>    20%           13797M            27M
>
> Yes, those numbers are correct. I didn't believe this at first, but the
> backups are valid/verified, checksums are OK, etc. BTRFS has similar
> numbers (e.g. drop from 20GB to 600MB).


Fantastic.

> I think we absolutely need to align the blocks in the incremental files,
> and I think we should do that now. I think 8K would work, but maybe we
> should add alignment parameter to basebackup & manifest?
>
> The reason why I think maybe this should be a basebackup parameter is
> the recent discussion about large fs blocks - it seems to be in the
> works, so maybe better to be ready and not assume all fs have 4K.
>
> And I think we probably want to do this now, because this affects all
> tools dealing with incremental backups - even if someone writes a custom
> version of pg_combinebackup, it will have to deal with misaligned data.
> Perhaps there might be something like pg_basebackup that "transforms"
> the data received from the server (and also the backup manifest), but
> that does not seem like a great direction.

+1, and I think BLCKSZ is the right choice.

> I was very puzzled by the awful performance on ZFS. When every other fs
> (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took
> 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning
> advice I could think of, with almost no effect.
>
> Ultimately I decided that it probably is the "no readahead" behavior
> I've observed on ZFS. I assume it's because it doesn't use the page
> cache where the regular readahead is detected etc. And there's no
> prefetching in pg_combinebackup, so I decided to an experiment and added
> a trivial explicit prefetch when reconstructing the file - every time
> we'd read data from a file, we do posix_fadvise for up to 128 blocks
> ahead (similar to what bitmap heap scan code does). See 0002.
>
> And tadaaa - the duration dropped from 900-1000 seconds to only about
> 250-300 seconds, so an improvement of a factor of 3-4x. I think this is
> pretty massive.

Interesting.  ZFS certainly has its own prefetching heuristics with
lots of logic and settings, but it could be that it's using
strict-next-block detection of access pattern (ie what I called
effective_io_readahead_window=0 in the streaming I/O thread) instead
of a window (ie like the Linux block device level read ahead where,
AFAIK, if you access anything in that sliding window it is triggered),
and perhaps your test has a lot of non-contiguous but close-enough
blocks?  (Also reminds me of the similar discussion on the BHS thread
about distinguishing sequential access from
mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the
fine line between them.)

You could double-check this and related settings (for example I think
it might disable itself automatically if you're on a VM with small RAM
size):

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable

> There's a couple more interesting ZFS details - the prefetching seems to
> be necessary even when using copy_file_range() and don't need to read
> the data (to calculate checksums). This is why the "manifest=off" chart
> has the strange group of high bars at the end - the copy cases are fast
> because prefetch happens, but if we switch to copy_file_range() there
> are no prefetches and it gets slow.

Hmm, at a guess, it might be due to prefetching the dnode (root object
for a file) and block pointers, ie the structure but not the data
itself.

> This is a bit bizarre, especially because the manifest=on cases are
> still fast, exactly because the pread + prefetching still happens. I'm
> sure users would find this puzzling.
>
> Unfortunately, the prefetching is not beneficial for all filesystems.
> For XFS it does not seem to make any difference, but on BTRFS it seems
> to cause a regression.
>
> I think this means we may need a "--prefetch" option, that'd force
> prefetching, probably both before pread and copy_file_range. Otherwise
> people on ZFS are doomed and will have poor performance.

Seems reasonable if you can't fix it by tuning ZFS.  (Might also be an
interesting research topic for a potential ZFS patch:
prefetch_swiss_cheese_window_size.  I will not be nerd-sniped into
reading the relevant source today, but I'll figure it out soonish...)

> So I took a stab at this in 0007, which detects runs of blocks coming
> from the same source file (limited to 128 blocks, i.e. 1MB). I only did
> this for the copy_file_range() calls in 0007, and the results for XFS
> look like this (complete results are in the PDF):
>
>            old (block-by-block)        new (batches)
>   ------------------------------------------------------
>     1%          150s                       4s
>    10%        150-200s                    46s
>    20%        150-200s                    65s
>
> Yes, once again, those results are real, the backups are valid etc. So
> not only it takes much less space (thanks to block alignment), it also
> takes much less time (thanks to bulk operations).

Again, fantastic.

Re: pg_combinebackup --copy-file-range

Reply via email to