On Tue, Apr 2, 2024 at 8:43 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > And I think he's right, and my tests confirm this. I did a trivial patch > to align the blocks to 8K boundary, by forcing the header to be a > multiple of 8K (I think 4K alignment would be enough). See the 0001 > patch that does this. > > And if I measure the disk space used by pg_combinebackup, and compare > the results with results without the patch ([1] from a couple days > back), I see this: > > pct not aligned aligned > ------------------------------------- > 1% 689M 19M > 10% 3172M 22M > 20% 13797M 27M > > Yes, those numbers are correct. I didn't believe this at first, but the > backups are valid/verified, checksums are OK, etc. BTRFS has similar > numbers (e.g. drop from 20GB to 600MB).
Fantastic. > I think we absolutely need to align the blocks in the incremental files, > and I think we should do that now. I think 8K would work, but maybe we > should add alignment parameter to basebackup & manifest? > > The reason why I think maybe this should be a basebackup parameter is > the recent discussion about large fs blocks - it seems to be in the > works, so maybe better to be ready and not assume all fs have 4K. > > And I think we probably want to do this now, because this affects all > tools dealing with incremental backups - even if someone writes a custom > version of pg_combinebackup, it will have to deal with misaligned data. > Perhaps there might be something like pg_basebackup that "transforms" > the data received from the server (and also the backup manifest), but > that does not seem like a great direction. +1, and I think BLCKSZ is the right choice. > I was very puzzled by the awful performance on ZFS. When every other fs > (EXT4/XFS/BTRFS) took 150-200 seconds to run pg_combinebackup, it took > 900-1000 seconds on ZFS, no matter what I did. I tried all the tuning > advice I could think of, with almost no effect. > > Ultimately I decided that it probably is the "no readahead" behavior > I've observed on ZFS. I assume it's because it doesn't use the page > cache where the regular readahead is detected etc. And there's no > prefetching in pg_combinebackup, so I decided to an experiment and added > a trivial explicit prefetch when reconstructing the file - every time > we'd read data from a file, we do posix_fadvise for up to 128 blocks > ahead (similar to what bitmap heap scan code does). See 0002. > > And tadaaa - the duration dropped from 900-1000 seconds to only about > 250-300 seconds, so an improvement of a factor of 3-4x. I think this is > pretty massive. Interesting. ZFS certainly has its own prefetching heuristics with lots of logic and settings, but it could be that it's using strict-next-block detection of access pattern (ie what I called effective_io_readahead_window=0 in the streaming I/O thread) instead of a window (ie like the Linux block device level read ahead where, AFAIK, if you access anything in that sliding window it is triggered), and perhaps your test has a lot of non-contiguous but close-enough blocks? (Also reminds me of the similar discussion on the BHS thread about distinguishing sequential access from mostly-sequential-but-with-lots-of-holes-like-Swiss-cheese, and the fine line between them.) You could double-check this and related settings (for example I think it might disable itself automatically if you're on a VM with small RAM size): https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-prefetch-disable > There's a couple more interesting ZFS details - the prefetching seems to > be necessary even when using copy_file_range() and don't need to read > the data (to calculate checksums). This is why the "manifest=off" chart > has the strange group of high bars at the end - the copy cases are fast > because prefetch happens, but if we switch to copy_file_range() there > are no prefetches and it gets slow. Hmm, at a guess, it might be due to prefetching the dnode (root object for a file) and block pointers, ie the structure but not the data itself. > This is a bit bizarre, especially because the manifest=on cases are > still fast, exactly because the pread + prefetching still happens. I'm > sure users would find this puzzling. > > Unfortunately, the prefetching is not beneficial for all filesystems. > For XFS it does not seem to make any difference, but on BTRFS it seems > to cause a regression. > > I think this means we may need a "--prefetch" option, that'd force > prefetching, probably both before pread and copy_file_range. Otherwise > people on ZFS are doomed and will have poor performance. Seems reasonable if you can't fix it by tuning ZFS. (Might also be an interesting research topic for a potential ZFS patch: prefetch_swiss_cheese_window_size. I will not be nerd-sniped into reading the relevant source today, but I'll figure it out soonish...) > So I took a stab at this in 0007, which detects runs of blocks coming > from the same source file (limited to 128 blocks, i.e. 1MB). I only did > this for the copy_file_range() calls in 0007, and the results for XFS > look like this (complete results are in the PDF): > > old (block-by-block) new (batches) > ------------------------------------------------------ > 1% 150s 4s > 10% 150-200s 46s > 20% 150-200s 65s > > Yes, once again, those results are real, the backups are valid etc. So > not only it takes much less space (thanks to block alignment), it also > takes much less time (thanks to bulk operations). Again, fantastic.