On Tue, Mar 26, 2024 at 7:03 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: [..] > > That's really strange.
Hi Tomas, but it looks like it's fixed now :) > > --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: > > 27m23.887s > > --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and > > loop-fix: 5m1.986s but it creates corruption as it stands > > > > Thanks. I plan to do more similar tests, once my machines get done with > some other stuff. Please do so as I do not trust my fingers :-) > > Issues: > > > > 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 > > compains about win32/mingw: > > [..] > > > > Yup, missing break. Now it's https://cirrus-ci.com/task/4997185324974080?logs=headers_headerscheck#L10 , reproducible with "make -s headerscheck EXTRAFLAGS='-fmax-errors=10'": /tmp/postgres/src/bin/pg_combinebackup/reconstruct.h:34:91: error: unknown type name ‘CopyMode’ / CopyMode copy_mode); to me it looks like reconstruct.h needs to include definition of CopyMode which is in "#include "copy_file.h" > > 2. I do not know what's the consensus between --clone and > > --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif > > HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also > > apply the same logic to the --help so that --clone is not visible > > there (for consistency?). Also the "error: file cloning not supported > > on this platform " is technically incorrect, Linux does support > > ioctl(FICLONE) and copy_file_range(), but we are just not choosing one > > over another (so technically it is "available"). Nitpicking I know. > > > > That's a good question, I'm not sure. But whatever we do, we should do > the same thing in pg_upgrade. Maybe there's some sort of precedent? Sigh, you are right... It's consistent hell. > > 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned > > ENOSPACE spiral-of-death-bug symptoms are like that: [..] > > Yeah, that retry logic is wrong. I ended up copying the check from the > "regular" copy branch, which simply bails out if copy_file_range returns > anything but the expected 8192. > > I wonder if this should deal with partial writes, though. I mean, it's > allowed copy_file_range() copies only some of the bytes - I don't know > how often / in what situations that happens, though ... And if we want > to handle that for copy_file_range(), pwrite() needs the same treatment. Maybe that helps? https://github.com/coreutils/coreutils/blob/606f54d157c3d9d558bdbe41da8d108993d86aeb/src/copy.c#L1427 , it's harder than I anticipated (we can ignore the sparse logic though, I think) > > 3. .. but onn startup I've got this after trying psql login: invalid > > page in block 0 of relation base/5/1259 . [..] > > Can you see if you can still reproduce this with the attached version? Looks like it's fixed now and it works great (~3min, multiple times)! BTW: I've tried to also try it over NFSv4 over loopback with XFS as copy_file_range() does server-side optimization probably, but somehow it was so slow there that's it is close to being unusable (~9GB out of 104GB reconstructed after 45mins) - maybe it's due to NFS mount opts, i don't think we should worry too much. I think it's related to missing the below optimization if that matters. I think it's too early to warn users about NFS (I've spent on it just 10 mins), but on the other hand people might complain it's broken... > > Apparently there's no merging of adjacent IO/s, so pg_combinebackup > > wastes lots of time on issuing instead small syscalls but it could > > let's say do single pread/write (or even copy_file_range()). I think > > it was not evident in my earlier testing (200GB; 39min vs ~40s) as I > > had much smaller modifications in my incremental (think of 99% of > > static data). > > > > Yes, I've been thinking about exactly this optimization, but I think > we're way past proposing this for PG17. The changes that would require > in reconstruct_from_incremental_file are way too significant. Has to > wait for PG18 ;-) Sure thing! > I do think there's more on the table, as mentioned by Thomas a couple > days ago - maybe we shouldn't approach clone/copy_file_range merely as > an optimization to save time, it might be entirely reasonable to do this > simply to allow the filesystem to do CoW magic and save space (even if > we need to read the data and recalculate the checksum, which now > disables these copy methods). Sure ! I think time will still be a priority though, as pg_combinebackup duration impacts RTO while disk space is relatively cheap. One could argue that reconstructing 50TB will be a challenge though. Now my tests indicate space saving is already happening with 0002 patch - 100GB DB / full backup stats look like that (so we are good I think when using CoW - not so without using CoW) -- or i misunderstood something?: root@jw-test-1:/backups# du -sm /backups/ 214612 /backups/ root@jw-test-1:/backups# du -sm * 106831 full 2823 incr.1 165 incr.2 104794 outtest root@jw-test-1:/backups# df -h . # note this double confirms that just 114GB is used (XFS), great! Filesystem Size Used Avail Use% Mounted on /dev/sdb1 500G 114G 387G 23% /backups root@jw-test-1:/backups# # https://github.com/pwaller/sharedextents root@jw-test-1:/backups# ./sharedextents-linux-amd64 full/base/5/16427.68 outtest/base/5/16427.68 1056915456 / 1073741824 bytes (98.43%) # extents reuse Now I was wondering a little bit if the huge XFS extent allocation won't hurt read performance (probably they were created due many independent copy_file_range() calls): root@jw-test-1:/backups# filefrag full/base/5/16427.68 full/base/5/16427.68: 1 extent found root@jw-test-1:/backups# filefrag outtest/base/5/16427.68 outtest/base/5/16427.68: 3979 extents found However in first look on seq reads of such CoW file it's still good (I'm assuming such backup after reconstruction would be copied back to the proper DB server from this backup server): root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches root@jw-test-1:/backups# time cat outtest/base/5/16427.68 > /dev/null real 0m4.286s root@jw-test-1:/backups# echo 3 > /proc/sys/vm/drop_caches root@jw-test-1:/backups# time cat full/base/5/16427.68 > /dev/null real 0m4.325s Now Thomas wrote there "then it might make sense to do that even if you *also* have to read the data to compute the checksums, I think? " ... sounds like read(), checksum() and still do copy_file_range() instead of pwrite? PG 18 ? :D -J.