On Sat, Mar 23, 2024 at 6:57 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> On 3/23/24 14:47, Tomas Vondra wrote: > > On 3/23/24 13:38, Robert Haas wrote: > >> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.mu...@gmail.com> > >> wrote: [..] > > Yeah, that's in write_reconstructed_file() and the patch does not touch > > that at all. I agree it would be nice to use copy_file_range() in this > > part too, and it doesn't seem it'd be that hard to do, I think. > > > > It seems we'd just need a "fork" that either calls pread/pwrite or > > copy_file_range, depending on checksums and what was requested. > > > > Here's a patch to use copy_file_range in write_reconstructed_file too, > when requested/possible. One thing that I'm not sure about is whether to > do pg_fatal() if --copy-file-range but the platform does not support it. [..] Hi Tomas, so I gave a go to the below patches today: - v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - v20240323-2-0002-write_reconstructed_file.patch My assessment: v20240323-2-0001-pg_combinebackup-allow-using-clone-copy_.patch - looks like more or less good to go v20240323-2-0002-write_reconstructed_file.patch - needs work and without that clone/copy_file_range() good effects are unlikely Given Debian 12, ~100GB DB, (pgbench -i -s 7000 , and some additional tables with GiST and GIN indexes , just to see more WAL record types) and with backups sizes in MB like that: 106831 full 2823 incr.1 # captured after some time with pgbench -R 100 165 incr.2 # captured after some time with pgbench -R 100 Test cmd: rm -rf outtest; sync; sync ; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches ; time /usr/pgsql17/bin/pg_combinebackup -o outtest full incr.1 incr.2 Test results of various copies on small I/O constrained XFS device: normal copy: 31m47.407s --clone copy: error: file cloning not supported on this platform (it's due #ifdef of having COPY_FILE_RANGE available) --copy-file-range: aborted, as it was taking too long , I was expecting it to accelerate, but it did not... obviously this is the transparent failover in case of calculating checksums... --manifest-checksums=NONE --copy-file-range: BUG, it keep on appending to just one file e.g. outtest/base/5/16427.29 with 200GB+ ?? and ended up with ENOSPC [more on this later] --manifest-checksums=NONE --copy-file-range without v20240323-2-0002: 27m23.887s --manifest-checksums=NONE --copy-file-range with v20240323-2-0002 and loop-fix: 5m1.986s but it creates corruption as it stands Issues: 1. https://cirrus-ci.com/task/5937513653600256?logs=mingw_cross_warning#L327 compains about win32/mingw: [15:47:27.184] In file included from copy_file.c:22: [15:47:27.184] copy_file.c: In function ‘copy_file’: [15:47:27.184] ../../../src/include/common/logging.h:134:6: error: this statement may fall through [-Werror=implicit-fallthrough=] [15:47:27.184] 134 | if (unlikely(__pg_log_level <= PG_LOG_DEBUG)) \ [15:47:27.184] | ^ [15:47:27.184] copy_file.c:96:5: note: in expansion of macro ‘pg_log_debug’ [15:47:27.184] 96 | pg_log_debug("would copy \"%s\" to \"%s\" (copy_file_range)", [15:47:27.184] | ^~~~~~~~~~~~ [15:47:27.184] copy_file.c:99:4: note: here [15:47:27.184] 99 | case COPY_MODE_COPYFILE: [15:47:27.184] | ^~~~ [15:47:27.184] cc1: all warnings being treated as errors 2. I do not know what's the consensus between --clone and --copy-file-range , but if we have #ifdef FICLONE clone_works() #elif HAVE_COPY_FILE_RANGE copy_file_range_only_works() then we should also apply the same logic to the --help so that --clone is not visible there (for consistency?). Also the "error: file cloning not supported on this platform " is technically incorrect, Linux does support ioctl(FICLONE) and copy_file_range(), but we are just not choosing one over another (so technically it is "available"). Nitpicking I know. 3. [v20240323-2-0002-write_reconstructed_file.patch]: The mentioned ENOSPACE spiral-of-death-bug symptoms are like that: strace: copy_file_range(8, [697671680], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697679872], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697688064], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697696256], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697704448], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697712640], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697720832], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697729024], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697737216], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697745408], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697753600], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697761792], 9, NULL, 8192, 0) = 8192 copy_file_range(8, [697769984], 9, NULL, 8192, 0) = 8192 Notice that dest_off_t (poutoff) is NULL. (gdb) where #0 0x00007f2cd56f6733 in copy_file_range (infd=8, pinoff=pinoff@entry=0x7f2cd53f54e8, outfd=outfd@entry=9, poutoff=poutoff@entry=0x0, length=length@entry=8192, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/copy_file_range.c:28 #1 0x00005555ecd077f4 in write_reconstructed_file (copy_mode=COPY_MODE_COPY_FILE_RANGE, dry_run=false, debug=true, checksum_ctx=0x7ffc4cdb7700, offsetmap=<optimized out>, sourcemap=0x7f2cd54f6010, block_length=<optimized out>, output_filename=0x7ffc4cdba910 "outtest/base/5/16427.29", input_filename=0x7ffc4cdba510 "incr.2/base/5/INCREMENTAL.16427.29") at reconstruct.c:648 #2 reconstruct_from_incremental_file (input_filename=input_filename@entry=0x7ffc4cdba510 "incr.2/base/5/INCREMENTAL.16427.29", output_filename=output_filename@entry=0x7ffc4cdba910 "outtest/base/5/16427.29", relative_path=relative_path@entry=0x7ffc4cdbc670 "base/5", bare_file_name=bare_file_name@entry=0x5555ee2056ef "16427.29", n_prior_backups=n_prior_backups@entry=2, prior_backup_dirs=prior_backup_dirs@entry=0x7ffc4cdbf248, manifests=0x5555ee137a10, manifest_path=0x7ffc4cdbad10 "base/5/16427.29", checksum_type=CHECKSUM_TYPE_NONE, checksum_length=0x7ffc4cdb9864, checksum_payload=0x7ffc4cdb9868, debug=true, dry_run=false, copy_method=COPY_MODE_COPY_FILE_RANGE) at reconstruct.c:327 .. it's a spiral of death till ENOSPC. Reverting the v20240323-2-0002-write_reconstructed_file.patch helps. The problem lies in that do-wb=-inifity-loop (?) along with NULL for destination off_t. This seem to solves that thingy(?): - do - { - wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0); + //do + //{ + wb = copy_file_range(s->fd, &offsetmap[i], wfd, &offsetmap[i], BLCKSZ, 0); if (wb < 0) pg_fatal("error while copying file range from \"%s\" to \"%s\": %m", input_filename, output_filename); - } while (wb > 0); + //} while (wb > 0); #else ...so that way I've got it down to 5mins. 3. .. but onn startup I've got this after trying psql login: invalid page in block 0 of relation base/5/1259 . I've again reverted the v20240323-2-0002 to see if that helped for next-round of pg_combinebackup --manifest-checksums=NONE --copy-file-range and after 32mins of waiting it did help indeed: I was able to login and select counts worked and matched properly the data. I've reapplied the v20240323-2-0002 (with my fix to prevent that endless loop) and the issue was again(!) there. Probably it's related to the destination offset. I couldn't find more time to look on it today and the setup was big 100GB on slow device, so just letting You know as fast as possible. 4. More efficiency is on the table option (optional patch node ; just for completeness; I dont think we have time for that? ): even if v20240323-2-0002 would work, the problem is that it would be sending syscall for every 8kB. We seem to be performing lots of per-8KB syscalls which hinder performance (both in copy_file_range and in normal copy): pread64(8, ""..., 8192, 369115136) = 8192 // 369115136 + 8192 = 369123328 (matches next pread offset) write(9, ""..., 8192) = 8192 pread64(8, ""..., 8192, 369123328) = 8192 // 369123328 + 8192 = 369131520 write(9, ""..., 8192) = 8192 pread64(8, ""..., 8192, 369131520) = 8192 // and so on write(9, ""..., 8192) = 8192 Apparently there's no merging of adjacent IO/s, so pg_combinebackup wastes lots of time on issuing instead small syscalls but it could let's say do single pread/write (or even copy_file_range()). I think it was not evident in my earlier testing (200GB; 39min vs ~40s) as I had much smaller modifications in my incremental (think of 99% of static data). 5. I think we should change the subject with new patch revision, so that such functionality for incremental backups is not buried down in the pg_upgrade thread ;) -J.