Big PostgreSQL databases use and regularly open/close huge numbers of file descriptors and directory entries for various anachronistic reasons, one of which is the 1GB RELSEG_SIZE thing. The segment management code is trickier that you might think and also still harbours known bugs.
A nearby analysis of yet another obscure segment life cycle bug reminded me of this patch set to switch to simple large files and eventually drop all that. I originally meant to develop the attached sketch-quality code further and try proposing it in the 16 cycle, while I was down the modernisation rabbit hole[1], but then I got side tracked: at some point I believed that the 56 bit relfilenode thing might be necessary for correctness, but then I found a set of rules that seem to hold up without that. I figured I might as well post what I have early in the 17 cycle as a "concept" patch to see which way the flames blow. There are various boring details due to Windows, and then a load of fairly obvious changes, and then a whole can of worms about how we'd handle the transition for the world's fleet of existing databases. I'll cut straight to that part. Different choices on aggressiveness could be made, but here are the straw-man answers I came up with so far: 1. All new relations would be in large format only. No 16384.N files, just 16384 that can grow to MaxBlockNumber * BLCKSZ. 2. The existence of a file 16384.1 means that this smgr relation is in legacy segmented format that came from pg_upgrade (note that we don't unlink that file once it exists, even when truncating the fork, until we eventually drop the relation). 3. Forks that were pg_upgrade'd from earlier releases using hard links or reflinks would implicitly be in large format if they only had one segment, and otherwise they could stay in the traditional format for a grace period of N major releases, after which we'd plan to drop segment support. pg_upgrade's [ref]link mode would therefore be the only way to get a segmented relation, other than a developer-only trick for testing/debugging. 4. Every opportunity to convert a multi-segment fork to large format would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE, VACUUM FULL, TRUNCATE, etc. You can see approximately working sketch versions of all the cases I thought of so far in the attached. 5. The main places that do file-level copying of relations would use copy_file_range() to do the splicing, so that on file systems that are smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and destination, the operation can be very fast, and other degrees of optimisation are available to the kernel too even for file systems without block sharing magic (pushing down block range copies to hardware/network storage, etc). The copy_file_range() stuff could also be proposed independently (I vaguely recall it was discussed a few times before), it's just that it really comes into its own when you start splicing files together, as needed here, and it's also been adopted by FreeBSD with the same interface as Linux and has an efficient implementation in bleeding edge ZFS there. Stepping back, the main ideas are: (1) for some users of large databases, it would be painlessly done at upgrade time without even really noticing, using modern file system facilities where possible for speed; (2) for anyone who wants to defer that because of lack of fast copy_file_range() and a desire to avoid prolonged downtime by using links or reflinks, concatenation can be put off for the next N releases, giving a total of 5 + N years of option to defer the work, and in that case there are also many ways to proactively change to large format before the time comes with varying degrees of granularity and disruption. For example, set up a new replica and fail over, or VACUUM FULL tables one at a time, etc. There are plenty of things left to do in this patch set: pg_rewind doesn't understand optional segmentation yet, there are probably more things like that, and I expect there are some ssize_t vs pgoff_t confusions I missed that could bite a 32 bit system. But you can see the basics working on a typical system. I am not aware of any modern/non-historic filesystem[2] that can't do large files with ease. Anyone know of anything to worry about on that front? I think the main collateral damage would be weird old external tools like some weird old version of Windows tar I occasionally see mentioned, that sort of thing, but that'd just be another case of "well don't use that then", I guess? What else might we need to think about, outside PostgreSQL? What other problems might occur inside PostgreSQL? Clearly we'd need to figure out a decent strategy to automate testing of all of the relevant transitions. We could test the splicing code paths with an optional test suite that you might enable along with a small segment size (as we're already testing on CI and probably BF after the last round of segmentation bugs). To test the messy Windows off_t API stuff convincingly, we'd need actual > 4GB files, I think? Maybe doable cheaply with file system hole punching tricks. Speaking of file system holes, this patch set doesn't touch buffile.c That code wants to use segments for two extra purposes: (1) parallel create index merges workers' output using segmentation tricks as if there were holes in the file; this could perhaps be replaced with large files that make use of actual OS-level holes but I didn't feel like additionally claiming that all computers have spare files -- perhaps another approach is needed anyway; (2) buffile.c deliberately spreads large buffiles around across multiple temporary tablespaces using segments supposedly for space management reasons. So although it initially looks like a nice safe little place to start using large files, we'd need an answer to those design choices first. /me dons flameproof suit and goes back to working on LLVM problems for a while [1] https://wiki.postgresql.org/wiki/AllComputers [2] https://en.wikipedia.org/wiki/Comparison_of_file_systems
From b4b6f27af1d196f9d6b3b8d5991216666cf2900f Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Mon, 24 Apr 2023 18:04:43 +1200 Subject: [PATCH 01/11] Assert that pgoff_t is wide enough. On Windows, we know it's wide enough because we define it directly ourselves. On Unix, we use off_t, which may only be 32 bits wide on some systems, depending on compiler switches or macros. Make absolutely certain that we are not confused on this point with an assertion, or we'd corrupt large files. --- src/backend/storage/file/fd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 277a28fc13..053588a302 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -102,6 +102,9 @@ #include "utils/resowner_private.h" #include "utils/varlena.h" +StaticAssertDecl(sizeof(pgoff_t) >= 8, + "pgoff_t not big enough to support large files"); + /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */ #if defined(HAVE_SYNC_FILE_RANGE) #define PG_FLUSH_DATA_WORKS 1 -- 2.40.1
From 6154e35d35515a7536524b79cb7ccd6a39d41afe Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 5 Mar 2023 11:24:51 +1300 Subject: [PATCH 02/11] Use pgoff_t in system call replacements on Windows. All modern Unix systems have 64 bit off_t, but Windows does not. Use our pgoff_t type in our POSIX-style replacement functions (lseek(), ftruncate(), pread(), pwrite() etc etc). Also in closely related functions like pg_pwrite_zeros(). --- configure | 6 +++ configure.ac | 1 + src/common/file_utils.c | 4 +- src/include/common/file_utils.h | 4 +- src/include/port.h | 2 +- src/include/port/pg_iovec.h | 4 +- src/include/port/win32_port.h | 23 ++++++++++-- src/port/meson.build | 1 + src/port/preadv.c | 2 +- src/port/pwritev.c | 2 +- src/port/win32ftruncate.c | 65 +++++++++++++++++++++++++++++++++ src/port/win32pread.c | 3 +- src/port/win32pwrite.c | 3 +- src/tools/msvc/Mkvcbuild.pm | 1 + 14 files changed, 106 insertions(+), 15 deletions(-) create mode 100644 src/port/win32ftruncate.c diff --git a/configure b/configure index 15daccc87f..47ba18491c 100755 --- a/configure +++ b/configure @@ -16537,6 +16537,12 @@ esac ;; esac + case " $LIBOBJS " in + *" win32ftruncate.$ac_objext "* ) ;; + *) LIBOBJS="$LIBOBJS win32ftruncate.$ac_objext" + ;; +esac + case " $LIBOBJS " in *" win32getrusage.$ac_objext "* ) ;; *) LIBOBJS="$LIBOBJS win32getrusage.$ac_objext" diff --git a/configure.ac b/configure.ac index 97f5be6c73..2b3b1b4dca 100644 --- a/configure.ac +++ b/configure.ac @@ -1905,6 +1905,7 @@ if test "$PORTNAME" = "win32"; then AC_LIBOBJ(win32env) AC_LIBOBJ(win32error) AC_LIBOBJ(win32fdatasync) + AC_LIBOBJ(win32ftruncate) AC_LIBOBJ(win32getrusage) AC_LIBOBJ(win32link) AC_LIBOBJ(win32ntdll) diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 74833c4acb..7a63434bc4 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -469,7 +469,7 @@ get_dirent_type(const char *path, * error is returned, it is unspecified how much has been written. */ ssize_t -pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) +pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset) { struct iovec iov_copy[PG_IOV_MAX]; ssize_t sum = 0; @@ -538,7 +538,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) * is returned with errno set. */ ssize_t -pg_pwrite_zeros(int fd, size_t size, off_t offset) +pg_pwrite_zeros(int fd, size_t size, pgoff_t offset) { static const PGIOAlignedBlock zbuffer = {{0}}; /* worth BLCKSZ */ void *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data; diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h index b7efa1226d..534277b12d 100644 --- a/src/include/common/file_utils.h +++ b/src/include/common/file_utils.h @@ -42,8 +42,8 @@ extern PGFileType get_dirent_type(const char *path, extern ssize_t pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, - off_t offset); + pgoff_t offset); -extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset); +extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset); #endif /* FILE_UTILS_H */ diff --git a/src/include/port.h b/src/include/port.h index a88d403483..f7707a390e 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -368,7 +368,7 @@ extern FILE *pgwin32_popen(const char *command, const char *type); * When necessary, these routines are provided by files in src/port/. */ -/* Type to use with fseeko/ftello */ +/* Type to use with lseek/ftruncate/pread/fseeko/ftello */ #ifndef WIN32 /* WIN32 is handled in port/win32_port.h */ #define pgoff_t off_t #endif diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h index 689799c425..c762fab662 100644 --- a/src/include/port/pg_iovec.h +++ b/src/include/port/pg_iovec.h @@ -43,13 +43,13 @@ struct iovec #if HAVE_DECL_PREADV #define pg_preadv preadv #else -extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); +extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset); #endif #if HAVE_DECL_PWRITEV #define pg_pwritev pwritev #else -extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); +extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset); #endif #endif /* PG_IOVEC_H */ diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h index 58965e0dfd..c757687386 100644 --- a/src/include/port/win32_port.h +++ b/src/include/port/win32_port.h @@ -76,11 +76,19 @@ #undef fstat #undef stat +/* and likewise for lseek hack */ +#define lseek microsoft_native_lseek +#include <io.h> +#undef lseek + +/* and also ftruncate, as defined by MinGW headers with 32 bit offset */ +#define ftruncate mingw_native_ftruncate +#include <unistd.h> +#undef ftruncate + /* Must be here to avoid conflicting with prototype in windows.h */ #define mkdir(a,b) mkdir(a) -#define ftruncate(a,b) chsize(a,b) - /* Windows doesn't have fsync() as such, use _commit() */ #define fsync(fd) _commit(fd) @@ -219,6 +227,7 @@ extern int _pgfseeko64(FILE *stream, pgoff_t offset, int origin); extern pgoff_t _pgftello64(FILE *stream); #define fseeko(stream, offset, origin) _pgfseeko64(stream, offset, origin) #define ftello(stream) _pgftello64(stream) +#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin)) #else #ifndef fseeko #define fseeko(stream, offset, origin) fseeko64(stream, offset, origin) @@ -226,7 +235,13 @@ extern pgoff_t _pgftello64(FILE *stream); #ifndef ftello #define ftello(stream) ftello64(stream) #endif +#ifndef lseek +#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin)) #endif +#endif + +/* 64 bit ftruncate is in win32ftruncate.c */ +extern int ftruncate(int fd, pgoff_t length); /* * Win32 also doesn't have symlinks, but we can emulate them with @@ -586,9 +601,9 @@ typedef unsigned short mode_t; #endif /* in port/win32pread.c */ -extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset); +extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset); /* in port/win32pwrite.c */ -extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset); +extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset); #endif /* PG_WIN32_PORT_H */ diff --git a/src/port/meson.build b/src/port/meson.build index 24416b9bfc..54ce59806a 100644 --- a/src/port/meson.build +++ b/src/port/meson.build @@ -35,6 +35,7 @@ if host_system == 'windows' 'win32error.c', 'win32fdatasync.c', 'win32fseek.c', + 'win32ftruncate.c', 'win32getrusage.c', 'win32link.c', 'win32ntdll.c', diff --git a/src/port/preadv.c b/src/port/preadv.c index e762283e67..6e5e92234f 100644 --- a/src/port/preadv.c +++ b/src/port/preadv.c @@ -19,7 +19,7 @@ #include "port/pg_iovec.h" ssize_t -pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset) +pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset) { ssize_t sum = 0; ssize_t part; diff --git a/src/port/pwritev.c b/src/port/pwritev.c index 519de45037..c430f99806 100644 --- a/src/port/pwritev.c +++ b/src/port/pwritev.c @@ -19,7 +19,7 @@ #include "port/pg_iovec.h" ssize_t -pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset) +pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset) { ssize_t sum = 0; ssize_t part; diff --git a/src/port/win32ftruncate.c b/src/port/win32ftruncate.c new file mode 100644 index 0000000000..5e6d4f3e92 --- /dev/null +++ b/src/port/win32ftruncate.c @@ -0,0 +1,65 @@ +/*------------------------------------------------------------------------- + * + * win32ftruncate.c + * Win32 ftruncate() replacement + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group + * + * src/port/win32ftruncate.c + * + *------------------------------------------------------------------------- + */ + +#ifdef FRONTEND +#include "postgres_fe.h" +#else +#include "postgres.h" +#endif + +int +ftruncate(int fd, pgoff_t length) +{ + HANDLE handle; + pgoff_t save_position; + + /* + * We can't use chsize() because it works with 32 bit off_t. We can't use + * _chsize_s() because it isn't available in MinGW. So we have to use + * SetEndOfFile(), but that works with the current position. So we save + * and restore it. + */ + + handle = (HANDLE) _get_osfhandle(fd); + if (handle == INVALID_HANDLE_VALUE) + { + errno = EBADF; + return -1; + } + + save_position = lseek(fd, 0, SEEK_CUR); + if (save_position < 0) + return -1; + + if (lseek(fd, length, SEEK_SET) < 0) + { + int save_errno = errno; + lseek(fd, save_position, SEEK_SET); + errno = save_errno; + return -1; + } + + if (!SetEndOfFile(handle)) + { + int save_errno; + + _dosmaperr(GetLastError()); + save_errno = errno; + lseek(fd, save_position, SEEK_SET); + errno = save_errno; + return -1; + } + lseek(fd, save_position, SEEK_SET); + + return 0; +} diff --git a/src/port/win32pread.c b/src/port/win32pread.c index 905cf9f42b..6e6366faaa 100644 --- a/src/port/win32pread.c +++ b/src/port/win32pread.c @@ -17,7 +17,7 @@ #include <windows.h> ssize_t -pg_pread(int fd, void *buf, size_t size, off_t offset) +pg_pread(int fd, void *buf, size_t size, pgoff_t offset) { OVERLAPPED overlapped = {0}; HANDLE handle; @@ -32,6 +32,7 @@ pg_pread(int fd, void *buf, size_t size, off_t offset) /* Note that this changes the file position, despite not using it. */ overlapped.Offset = offset; + overlapped.OffsetHigh = offset >> 32; if (!ReadFile(handle, buf, size, &result, &overlapped)) { if (GetLastError() == ERROR_HANDLE_EOF) diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c index 5dd10821cf..90dd93dbc5 100644 --- a/src/port/win32pwrite.c +++ b/src/port/win32pwrite.c @@ -17,7 +17,7 @@ #include <windows.h> ssize_t -pg_pwrite(int fd, const void *buf, size_t size, off_t offset) +pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset) { OVERLAPPED overlapped = {0}; HANDLE handle; @@ -32,6 +32,7 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset) /* Note that this changes the file position, despite not using it. */ overlapped.Offset = offset; + overlapped.OffsetHigh = offset >> 32; if (!WriteFile(handle, buf, size, &result, &overlapped)) { _dosmaperr(GetLastError()); diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm index 958206f315..4b96c2bb44 100644 --- a/src/tools/msvc/Mkvcbuild.pm +++ b/src/tools/msvc/Mkvcbuild.pm @@ -113,6 +113,7 @@ sub mkvcbuild win32env.c win32error.c win32fdatasync.c win32fseek.c + win32ftruncate.c win32getrusage.c win32gettimeofday.c win32link.c -- 2.40.1
From 2782d8c1b5c6ff266488536c49cb3a4d4a7b4da6 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 5 Mar 2023 11:27:16 +1300 Subject: [PATCH 03/11] Support large files on Windows in our VFD API. All fd.c interfaces that take off_t now need to use pgoff_t instead, because we can't use Windows' 32 bit off_t. --- src/backend/storage/file/fd.c | 30 +++++++++++++++--------------- src/include/storage/fd.h | 20 ++++++++++---------- 2 files changed, 25 insertions(+), 25 deletions(-) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 053588a302..f5e194a797 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -204,7 +204,7 @@ typedef struct vfd File nextFree; /* link to next free VFD, if in freelist */ File lruMoreRecently; /* doubly linked recency-of-use list */ File lruLessRecently; - off_t fileSize; /* current size of file (0 if not temporary) */ + pgoff_t fileSize; /* current size of file (0 if not temporary) */ char *fileName; /* name of file, or NULL for unused VFD */ /* NB: fileName is malloc'd, and must be free'd when closing the VFD */ int fileFlags; /* open(2) flags for (re)opening the file */ @@ -463,7 +463,7 @@ pg_fdatasync(int fd) * offset of 0 with nbytes 0 means that the entire file should be flushed */ void -pg_flush_data(int fd, off_t offset, off_t nbytes) +pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes) { /* * Right now file flushing is primarily used to avoid making later @@ -636,7 +636,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes) * Truncate a file to a given length by name. */ int -pg_truncate(const char *path, off_t length) +pg_truncate(const char *path, pgoff_t length) { #ifdef WIN32 int save_errno; @@ -1439,7 +1439,7 @@ FileAccess(File file) * Called whenever a temporary file is deleted to report its size. */ static void -ReportTemporaryFileUsage(const char *path, off_t size) +ReportTemporaryFileUsage(const char *path, pgoff_t size) { pgstat_report_tempfile(size); @@ -1989,7 +1989,7 @@ FileClose(File file) * to read into. */ int -FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info) +FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info) { #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED) int returnCode; @@ -2017,7 +2017,7 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info) } void -FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info) +FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info) { int returnCode; @@ -2043,7 +2043,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info) } int -FileRead(File file, void *buffer, size_t amount, off_t offset, +FileRead(File file, void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info) { int returnCode; @@ -2099,7 +2099,7 @@ retry: } int -FileWrite(File file, const void *buffer, size_t amount, off_t offset, +FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info) { int returnCode; @@ -2128,7 +2128,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset, */ if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT)) { - off_t past_write = offset + amount; + pgoff_t past_write = offset + amount; if (past_write > vfdP->fileSize) { @@ -2160,7 +2160,7 @@ retry: */ if (vfdP->fdstate & FD_TEMP_FILE_LIMIT) { - off_t past_write = offset + amount; + pgoff_t past_write = offset + amount; if (past_write > vfdP->fileSize) { @@ -2224,7 +2224,7 @@ FileSync(File file, uint32 wait_event_info) * appropriate error. */ int -FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info) +FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info) { int returnCode; ssize_t written; @@ -2269,7 +2269,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info) * appropriate error. */ int -FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) +FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info) { #ifdef HAVE_POSIX_FALLOCATE int returnCode; @@ -2305,7 +2305,7 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info) return FileZero(file, offset, amount, wait_event_info); } -off_t +pgoff_t FileSize(File file) { Assert(FileIsValid(file)); @@ -2316,14 +2316,14 @@ FileSize(File file) if (FileIsNotOpen(file)) { if (FileAccess(file) < 0) - return (off_t) -1; + return (pgoff_t) -1; } return lseek(VfdCache[file].fd, 0, SEEK_END); } int -FileTruncate(File file, off_t offset, uint32 wait_event_info) +FileTruncate(File file, pgoff_t offset, uint32 wait_event_info) { int returnCode; diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 6791a406fc..a4528428ff 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -110,16 +110,16 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags); extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode); extern File OpenTemporaryFile(bool interXact); extern void FileClose(File file); -extern int FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info); -extern int FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info); -extern int FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info); +extern int FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info); +extern int FileRead(File file, void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info); +extern int FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info); extern int FileSync(File file, uint32 wait_event_info); -extern int FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info); -extern int FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info); +extern int FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info); +extern int FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info); -extern off_t FileSize(File file); -extern int FileTruncate(File file, off_t offset, uint32 wait_event_info); -extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info); +extern pgoff_t FileSize(File file); +extern int FileTruncate(File file, pgoff_t offset, uint32 wait_event_info); +extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info); extern char *FilePathName(File file); extern int FileGetRawDesc(File file); extern int FileGetRawFlags(File file); @@ -186,8 +186,8 @@ extern int pg_fsync(int fd); extern int pg_fsync_no_writethrough(int fd); extern int pg_fsync_writethrough(int fd); extern int pg_fdatasync(int fd); -extern void pg_flush_data(int fd, off_t offset, off_t nbytes); -extern int pg_truncate(const char *path, off_t length); +extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes); +extern int pg_truncate(const char *path, pgoff_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int elevel); -- 2.40.1
From ed3a5558a03afaabb7c4c206c053c288c104cb02 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 5 Mar 2023 12:36:55 +1300 Subject: [PATCH 04/11] Use pgoff_t instead of off_t in more places. XXX Incomplete --- src/backend/access/heap/rewriteheap.c | 2 +- src/backend/backup/basebackup.c | 7 ++++--- src/backend/storage/file/copydir.c | 4 ++-- src/bin/pg_basebackup/receivelog.c | 2 +- src/bin/pg_rewind/file_ops.c | 4 ++-- src/bin/pg_rewind/file_ops.h | 4 ++-- src/bin/pg_rewind/filemap.c | 2 ++ src/bin/pg_rewind/libpq_source.c | 6 +++--- src/bin/pg_rewind/local_source.c | 8 ++++---- src/bin/pg_rewind/pg_rewind.c | 2 +- src/bin/pg_rewind/rewind_source.h | 2 +- src/include/access/heapam_xlog.h | 2 +- 12 files changed, 24 insertions(+), 21 deletions(-) diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 424958912c..5e5b00d25a 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -194,7 +194,7 @@ typedef struct RewriteMappingFile { TransactionId xid; /* xid that might need to see the row */ int vfd; /* fd of mappings file */ - off_t off; /* how far have we written yet */ + pgoff_t off; /* how far have we written yet */ dclist_head mappings; /* list of in-memory mappings */ char path[MAXPGPATH]; /* path, for error messages */ } RewriteMappingFile; diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 5baea7535b..2dcc04fef2 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -95,7 +95,8 @@ static void perform_base_backup(basebackup_options *opt, bbsink *sink); static void parse_basebackup_options(List *options, basebackup_options *opt); static int compareWalFileNames(const ListCell *a, const ListCell *b); static bool is_checksummed_file(const char *fullpath, const char *filename); -static int basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, +static int basebackup_read_file(int fd, char *buf, size_t nbytes, + pgoff_t offset, const char *filename, bool partial_read_ok); /* Was the backup currently in-progress initiated in recovery mode? */ @@ -1488,7 +1489,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, bool block_retry = false; uint16 checksum; int checksum_failures = 0; - off_t cnt; + pgoff_t cnt; int i; pgoff_t len = 0; char *page; @@ -1827,7 +1828,7 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf) * Returns the number of bytes read. */ static int -basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset, +basebackup_read_file(int fd, char *buf, size_t nbytes, pgoff_t offset, const char *filename, bool partial_read_ok) { int rc; diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index e04bc3941a..82f77536b4 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -120,8 +120,8 @@ copy_file(const char *fromfile, const char *tofile) int srcfd; int dstfd; int nbytes; - off_t offset; - off_t flush_offset; + pgoff_t offset; + pgoff_t flush_offset; /* Size of copy buffer (read and write requests) */ #define COPY_BUF_SIZE (8 * BLCKSZ) diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c index 504d82bef6..e69ad912a2 100644 --- a/src/bin/pg_basebackup/receivelog.c +++ b/src/bin/pg_basebackup/receivelog.c @@ -192,7 +192,7 @@ static bool close_walfile(StreamCtl *stream, XLogRecPtr pos) { char *fn; - off_t currpos; + pgoff_t currpos; int r; char walfile_name[MAXPGPATH]; diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c index 25996b4da4..3e96b8b0a8 100644 --- a/src/bin/pg_rewind/file_ops.c +++ b/src/bin/pg_rewind/file_ops.c @@ -85,7 +85,7 @@ close_target_file(void) } void -write_target_range(char *buf, off_t begin, size_t size) +write_target_range(char *buf, pgoff_t begin, size_t size) { size_t writeleft; char *p; @@ -203,7 +203,7 @@ remove_target_file(const char *path, bool missing_ok) } void -truncate_target_file(const char *path, off_t newsize) +truncate_target_file(const char *path, pgoff_t newsize) { char dstpath[MAXPGPATH]; int fd; diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h index 427cf8e0b5..41a41cb6cb 100644 --- a/src/bin/pg_rewind/file_ops.h +++ b/src/bin/pg_rewind/file_ops.h @@ -13,10 +13,10 @@ #include "filemap.h" extern void open_target_file(const char *path, bool trunc); -extern void write_target_range(char *buf, off_t begin, size_t size); +extern void write_target_range(char *buf, pgoff_t begin, size_t size); extern void close_target_file(void); extern void remove_target_file(const char *path, bool missing_ok); -extern void truncate_target_file(const char *path, off_t newsize); +extern void truncate_target_file(const char *path, pgoff_t newsize); extern void create_target(file_entry_t *entry); extern void remove_target(file_entry_t *entry); extern void sync_target_dir(void); diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c index bd5c598e20..a5855ccaa9 100644 --- a/src/bin/pg_rewind/filemap.c +++ b/src/bin/pg_rewind/filemap.c @@ -296,6 +296,8 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator, BlockNumber blkno_inseg; int segno; + /* XXX We need to know if it is segmented! */ + segno = blkno / RELSEG_SIZE; blkno_inseg = blkno % RELSEG_SIZE; diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c index 5f486b2a61..d4832ccb76 100644 --- a/src/bin/pg_rewind/libpq_source.c +++ b/src/bin/pg_rewind/libpq_source.c @@ -30,7 +30,7 @@ typedef struct { const char *path; /* path relative to data directory root */ - off_t offset; + pgoff_t offset; size_t length; } fetch_range_request; @@ -65,7 +65,7 @@ static void libpq_traverse_files(rewind_source *source, process_file_callback_t callback); static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len); static void libpq_queue_fetch_range(rewind_source *source, const char *path, - off_t off, size_t len); + pgoff_t off, size_t len); static void libpq_finish_fetch(rewind_source *source); static char *libpq_fetch_file(rewind_source *source, const char *path, size_t *filesize); @@ -343,7 +343,7 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len) * Queue up a request to fetch a piece of a file from remote system. */ static void -libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off, +libpq_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off, size_t len) { libpq_source *src = (libpq_source *) source; diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c index 4e2a1376c6..fb84309c12 100644 --- a/src/bin/pg_rewind/local_source.c +++ b/src/bin/pg_rewind/local_source.c @@ -32,7 +32,7 @@ static char *local_fetch_file(rewind_source *source, const char *path, static void local_queue_fetch_file(rewind_source *source, const char *path, size_t len); static void local_queue_fetch_range(rewind_source *source, const char *path, - off_t off, size_t len); + pgoff_t off, size_t len); static void local_finish_fetch(rewind_source *source); static void local_destroy(rewind_source *source); @@ -125,15 +125,15 @@ local_queue_fetch_file(rewind_source *source, const char *path, size_t len) * Copy a file from source to target, starting at 'off', for 'len' bytes. */ static void -local_queue_fetch_range(rewind_source *source, const char *path, off_t off, +local_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off, size_t len) { const char *datadir = ((local_source *) source)->datadir; PGIOAlignedBlock buf; char srcpath[MAXPGPATH]; int srcfd; - off_t begin = off; - off_t end = off + len; + pgoff_t begin = off; + pgoff_t end = off + len; snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path); diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index f7f3b8227f..500842e169 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -566,7 +566,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source, { datapagemap_iterator_t *iter; BlockNumber blkno; - off_t offset; + pgoff_t offset; iter = datapagemap_iterate(&entry->target_pages_to_overwrite); while (datapagemap_next(iter, &blkno)) diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h index 69ad0e495f..e17526ce86 100644 --- a/src/bin/pg_rewind/rewind_source.h +++ b/src/bin/pg_rewind/rewind_source.h @@ -45,7 +45,7 @@ typedef struct rewind_source * queue and execute all requests. */ void (*queue_fetch_range) (struct rewind_source *, const char *path, - off_t offset, size_t len); + pgoff_t offset, size_t len); /* * Like queue_fetch_range(), but requests replacing the whole local file diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index a038450787..d82cd027f4 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -396,7 +396,7 @@ typedef struct xl_heap_rewrite_mapping TransactionId mapped_xid; /* xid that might need to see the row */ Oid mapped_db; /* DbOid or InvalidOid for shared rels */ Oid mapped_rel; /* Oid of the mapped relation */ - off_t offset; /* How far have we written so far */ + pgoff_t offset; /* How far have we written so far */ uint32 num_mappings; /* Number of in-memory mappings */ XLogRecPtr start_lsn; /* Insert LSN at begin of rewrite */ } xl_heap_rewrite_mapping; -- 2.40.1
From d22479403d02944e6c2569897816137f8582c6f1 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 5 Mar 2023 11:51:15 +1300 Subject: [PATCH 05/11] Use large files for relation storage. Traditionally we broke files up into 1Gb segments (configurable) to support older OSes before the industry transition to "large files" in the mid 90s. These days, the only remaining consideration on living operating systems is that Windows still has 32 bit types in a few interfaces, but we deal with that by being careful to use pgoff_t everywhere instead of off_t. Having many segment files creates extra work for the kernel, which must manage many more descriptors, and extra work for PostgreSQL, which must close and reopen them to stay under per-process descriptor limits. With this patch, all new relations will be non-segmented. The only way to have a segmented relation is to inherit it via pg_upgrade. For some number of releases, legacy segmented relations will be supported, and can be upgraded to non-segmented format by any operation that rewrites the relation, creating a new relfilenode (VACUUM FULL, etc). --- src/backend/storage/smgr/md.c | 227 +++++++++++++++++++++++++++------- src/include/storage/smgr.h | 1 + 2 files changed, 181 insertions(+), 47 deletions(-) diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index e982a8dd7f..005a7a15bf 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -42,6 +42,14 @@ #include "utils/memutils.h" /* + * The magnetic disk storage manager assumes that the operating system + * supports "large files". Historically, this wasn't the case, so there is + * support for "segmented" files that were upgraded from earlier releases. + * A future release may eventually drop support for those. See + * md_fork_is_segmented() for details. + * + * The following paragraphs describe the historical behavior. + * * The magnetic disk storage manager keeps track of open file * descriptors in its own descriptor pool. This is done to make it * easier to support relations that are larger than the operating @@ -119,6 +127,9 @@ static MemoryContext MdCxt; /* context for all MdfdVec objects */ /* don't try to open a segment, if not already open */ #define EXTENSION_DONT_OPEN (1 << 5) +#define MD_FORK_SEGMENTED_UNKNOWN 'u' +#define MD_FORK_SEGMENTED_FALSE 'f' +#define MD_FORK_SEGMENTED_TRUE 't' /* local routines */ static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, @@ -139,8 +150,11 @@ static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno, int oflags); static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, bool skipFsync, int behavior); +static pgoff_t getseekpos(SMgrRelation reln, ForkNumber forknum, + BlockNumber blocknum); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); +static bool md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum); static inline int _mdfd_open_flags(void) @@ -459,7 +473,7 @@ void mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync) { - off_t seekpos; + pgoff_t seekpos; int nbytes; MdfdVec *v; @@ -486,10 +500,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, InvalidBlockNumber))); v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE); - - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + seekpos = getseekpos(reln, forknum, blocknum); if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ) { @@ -511,7 +522,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, if (!skipFsync && !SmgrIsTemp(reln)) register_dirty_segment(reln, forknum, v); - Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); + if (md_fork_is_segmented(reln, forknum)) + Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); } /* @@ -549,20 +561,30 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, while (remblocks > 0) { - BlockNumber segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE); - off_t seekpos = (off_t) BLCKSZ * segstartblock; + BlockNumber segstartblock; + pgoff_t seekpos; int numblocks; - if (segstartblock + remblocks > RELSEG_SIZE) - numblocks = RELSEG_SIZE - segstartblock; + if (md_fork_is_segmented(reln, forknum)) + { + segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE); + seekpos = (pgoff_t) BLCKSZ * segstartblock; + if (segstartblock + remblocks > RELSEG_SIZE) + numblocks = RELSEG_SIZE - segstartblock; + else + numblocks = remblocks; + Assert(segstartblock < RELSEG_SIZE); + Assert(segstartblock + numblocks <= RELSEG_SIZE); + } else + { + segstartblock = curblocknum; + seekpos = (pgoff_t) BLCKSZ * segstartblock; numblocks = remblocks; + } v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE); - Assert(segstartblock < RELSEG_SIZE); - Assert(segstartblock + numblocks <= RELSEG_SIZE); - /* * If available and useful, use posix_fallocate() (via FileAllocate()) * to extend the relation. That's often more efficient than using @@ -579,7 +601,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, int ret; ret = FileFallocate(v->mdfd_vfd, - seekpos, (off_t) BLCKSZ * numblocks, + seekpos, (pgoff_t) BLCKSZ * numblocks, WAIT_EVENT_DATA_FILE_EXTEND); if (ret != 0) { @@ -602,7 +624,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, * zeroed buffer for the whole length of the extension. */ ret = FileZero(v->mdfd_vfd, - seekpos, (off_t) BLCKSZ * numblocks, + seekpos, (pgoff_t) BLCKSZ * numblocks, WAIT_EVENT_DATA_FILE_EXTEND); if (ret < 0) ereport(ERROR, @@ -615,7 +637,8 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum, if (!skipFsync && !SmgrIsTemp(reln)) register_dirty_segment(reln, forknum, v); - Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); + if (md_fork_is_segmented(reln, forknum)) + Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE)); remblocks -= numblocks; curblocknum += numblocks; @@ -644,7 +667,6 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior) return &reln->md_seg_fds[forknum][0]; path = relpath(reln->smgr_rlocator, forknum); - fd = PathNameOpenFile(path, _mdfd_open_flags()); if (fd < 0) @@ -667,7 +689,8 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior) mdfd->mdfd_vfd = fd; mdfd->mdfd_segno = 0; - Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE)); + if (md_fork_is_segmented(reln, forknum)) + Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE)); return mdfd; } @@ -680,7 +703,10 @@ mdopen(SMgrRelation reln) { /* mark it not open */ for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++) + { + reln->md_segmented[forknum] = MD_FORK_SEGMENTED_UNKNOWN; reln->md_num_open_segs[forknum] = 0; + } } /* @@ -713,7 +739,7 @@ bool mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) { #ifdef USE_PREFETCH - off_t seekpos; + pgoff_t seekpos; MdfdVec *v; Assert((io_direct_flags & IO_DIRECT_DATA) == 0); @@ -723,9 +749,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) if (v == NULL) return false; - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + seekpos = getseekpos(reln, forknum, blocknum); (void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH); #endif /* USE_PREFETCH */ @@ -752,10 +776,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum, while (nblocks > 0) { BlockNumber nflush = nblocks; - off_t seekpos; + pgoff_t seekpos; MdfdVec *v; - int segnum_start, - segnum_end; v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ , EXTENSION_DONT_OPEN); @@ -770,20 +792,26 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum, if (!v) return; - /* compute offset inside the current segment */ - segnum_start = blocknum / RELSEG_SIZE; + if (md_fork_is_segmented(reln, forknum)) + { + int segnum_start, + segnum_end; + + /* compute offset inside the current segment */ + segnum_start = blocknum / RELSEG_SIZE; - /* compute number of desired writes within the current segment */ - segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE; - if (segnum_start != segnum_end) - nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)); + /* compute number of desired writes within the current segment */ + segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE; + if (segnum_start != segnum_end) + nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)); - Assert(nflush >= 1); - Assert(nflush <= nblocks); + Assert(nflush >= 1); + Assert(nflush <= nblocks); + } - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); + seekpos = getseekpos(reln, forknum, blocknum); - FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH); + FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH); nblocks -= nflush; blocknum += nflush; @@ -797,7 +825,7 @@ void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, void *buffer) { - off_t seekpos; + pgoff_t seekpos; int nbytes; MdfdVec *v; @@ -814,9 +842,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + seekpos = getseekpos(reln, forknum, blocknum); nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ); @@ -866,7 +892,7 @@ void mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, const void *buffer, bool skipFsync) { - off_t seekpos; + pgoff_t seekpos; int nbytes; MdfdVec *v; @@ -888,9 +914,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY); - seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); - - Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE); + seekpos = getseekpos(reln, forknum, blocknum); nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE); @@ -962,6 +986,13 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum) for (;;) { nblocks = _mdnblocks(reln, forknum, v); + + if (!md_fork_is_segmented(reln, forknum)) + { + Assert(segno == 0); + return nblocks; + } + if (nblocks > ((BlockNumber) RELSEG_SIZE)) elog(FATAL, "segment too big"); if (nblocks < ((BlockNumber) RELSEG_SIZE)) @@ -1013,6 +1044,25 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) if (nblocks == curnblk) return; /* no work */ + if (!md_fork_is_segmented(reln, forknum)) + { + MdfdVec *v; + + Assert(reln->md_num_open_segs[forknum] == 1); + v = &reln->md_seg_fds[forknum][0]; + + if (FileTruncate(v->mdfd_vfd, (pgoff_t) nblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not truncate file \"%s\" to %u blocks: %m", + FilePathName(v->mdfd_vfd), + nblocks))); + if (!SmgrIsTemp(reln)) + register_dirty_segment(reln, forknum, v); + + return; + } + /* * Truncate segments, starting at the last one. Starting at the end makes * managing the memory for the fd array easier, should there be errors. @@ -1058,7 +1108,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) */ BlockNumber lastsegblocks = nblocks - priorblocks; - if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0) + if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0) ereport(ERROR, (errcode_for_file_access(), errmsg("could not truncate file \"%s\" to %u blocks: %m", @@ -1396,7 +1446,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL | EXTENSION_DONT_OPEN)); - targetseg = blkno / ((BlockNumber) RELSEG_SIZE); + if (md_fork_is_segmented(reln, forknum)) + targetseg = blkno / ((BlockNumber) RELSEG_SIZE); + else + targetseg = 0; /* if an existing and opened segment, we're done */ if (targetseg < reln->md_num_open_segs[forknum]) @@ -1433,7 +1486,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, Assert(nextsegno == v->mdfd_segno + 1); - if (nblocks > ((BlockNumber) RELSEG_SIZE)) + if (md_fork_is_segmented(reln, forknum) && + nblocks > ((BlockNumber) RELSEG_SIZE)) elog(FATAL, "segment too big"); if ((behavior & EXTENSION_CREATE) || @@ -1493,6 +1547,9 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, blkno, nblocks))); } + if (!md_fork_is_segmented(reln, forknum)) + break; + v = _mdfd_openseg(reln, forknum, nextsegno, flags); if (v == NULL) @@ -1511,13 +1568,22 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, return v; } +static pgoff_t +getseekpos(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) +{ + if (md_fork_is_segmented(reln, forknum)) + return (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE)); + + return (pgoff_t) BLCKSZ * blocknum; +} + /* * Get number of blocks present in a single disk file */ static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg) { - off_t len; + pgoff_t len; len = FileSize(seg->mdfd_vfd); if (len < 0) @@ -1618,3 +1684,70 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate) */ return ftag->rlocator.dbOid == candidate->rlocator.dbOid; } + +/* + * Is this fork in legacy segmented format, inherited from an easlier release + * via pg_upgrade? + */ +bool +md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum) +{ + char path_probe[MAXPGPATH]; + char *path; + + Assert(forknum >= 0 && forknum <= MAX_FORKNUM); + + /* Fast return if we have the answer cached. */ + if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_FALSE) + return false; + if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_TRUE) + return true; + + Assert(reln->md_segmented[forknum] == MD_FORK_SEGMENTED_UNKNOWN); + + /* + * All backends must agree, using only clues from the file system, and the + * answer must not change for as long as this relation exists. The + * correctness of this strategy depends on the following properties: + * + * 1. When segmented forks are truncated, their higher numbered segments + * are truncated to size zero, but they still exist. That is, higher + * segments won't be unlinked for as long as the relation exists. + * + * 2. We don't create new segmented relations, so the only way they can + * exist is if we inherited them via pg_upgrade from an earlier + * release. + * + * 3. Relations that never had more than one segment and were pg_upgraded + * are indistinguishable from newly created (non-segmented) relations. + * + * 4. If the relfilenode is recycled for a later relation, all backends + * will close all segments first before potentially reopening the next + * generation, either via the sinval or ProcSignalBarrier cache + * invalidation system. + * + * Therefore, it is safe for every backend to determine whether the fork is + * segmented by checking the existence of a ".1" file. + */ + path = relpath(reln->smgr_rlocator, forknum); + snprintf(path_probe, sizeof(path_probe), "%s.1", path); + if (access(path_probe, F_OK) == 0) + { + pfree(path); + reln->md_segmented[forknum] = MD_FORK_SEGMENTED_TRUE; + return true; + } + else if (errno == ENOENT) + { + pfree(path); + reln->md_segmented[forknum] = MD_FORK_SEGMENTED_FALSE; + return false; + } + pfree(path); + + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read access in file \"%s\": %m", + path_probe))); + pg_unreachable(); +} diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a9a179aaba..e352a035be 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -65,6 +65,7 @@ typedef struct SMgrRelationData * for md.c; per-fork arrays of the number of open segments * (md_num_open_segs) and the segments themselves (md_seg_fds). */ + char md_segmented[MAX_FORKNUM + 1]; int md_num_open_segs[MAX_FORKNUM + 1]; struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; -- 2.40.1
From d1ffce7141cd34eff9d0d3f65f5e18f472b6d813 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 30 Apr 2023 10:38:46 +1200 Subject: [PATCH 06/11] Detect copy_file_range() function. --- configure | 2 +- configure.ac | 1 + meson.build | 1 + src/include/pg_config.h.in | 3 +++ src/tools/msvc/Solution.pm | 1 + 5 files changed, 7 insertions(+), 1 deletion(-) diff --git a/configure b/configure index 47ba18491c..7d351b9614 100755 --- a/configure +++ b/configure @@ -15700,7 +15700,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l +for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.ac b/configure.ac index 2b3b1b4dca..ddb82e9433 100644 --- a/configure.ac +++ b/configure.ac @@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` AC_CHECK_FUNCS(m4_normalize([ backtrace_symbols copyfile + copy_file_range getifaddrs getpeerucred inet_pton diff --git a/meson.build b/meson.build index 096044628c..c06e4f9290 100644 --- a/meson.build +++ b/meson.build @@ -2404,6 +2404,7 @@ func_checks = [ ['backtrace_symbols', {'dependencies': [execinfo_dep]}], ['clock_gettime', {'dependencies': [rt_dep, posix4_dep], 'define': false}], ['copyfile'], + ['copy_file_range'], # gcc/clang's sanitizer helper library provides dlopen but not dlsym, thus # when enabling asan the dlopen check doesn't notice that -ldl is actually # required. Just checking for dlsym() ought to suffice. diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 6d572c3820..0b26836f68 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -85,6 +85,9 @@ /* Define to 1 if you have the <copyfile.h> header file. */ #undef HAVE_COPYFILE_H +/* Define to 1 if you have the `copy_file_range' function. */ +#undef HAVE_COPY_FILE_RANGE + /* Define to 1 if you have the <crtdefs.h> header file. */ #undef HAVE_CRTDEFS_H diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index ef10cda576..671d958af7 100644 --- a/src/tools/msvc/Solution.pm +++ b/src/tools/msvc/Solution.pm @@ -230,6 +230,7 @@ sub GenerateFiles HAVE_COMPUTED_GOTO => undef, HAVE_COPYFILE => undef, HAVE_COPYFILE_H => undef, + HAVE_COPY_FILE_RANGE => undef, HAVE_CRTDEFS_H => undef, HAVE_CRYPTO_LOCK => undef, HAVE_DECL_FDATASYNC => 0, -- 2.40.1
From d89cbae1851627be4e146efedc92ba9d0a67ad6a Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 30 Apr 2023 11:10:08 +1200 Subject: [PATCH 07/11] Use copy_file_range() to implement copy_file(). If copy_file_range() is available, use it to implement copy_file(), so that the operating system has opportunities for efficient copying, block cloning and pushdown. This affects the commands CREATE DATABASE STRATEGY=FILE_COPY and ALTER TABLE SET TABLESPACE, which perform bulk file copies. On older Linux systems, copy_file_range() might fail with EXDEV, so we look out for that and fall back to the traditional read/write loop. XXX Should we also let the user opt out? --- doc/src/sgml/monitoring.sgml | 4 ++ src/backend/storage/file/copydir.c | 94 +++++++++++++++++++------ src/backend/utils/activity/wait_event.c | 3 + src/include/utils/wait_event.h | 1 + 4 files changed, 82 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 99f7f95c39..2161b32b17 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -1317,6 +1317,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser <entry>Waiting for a write to update the <filename>pg_control</filename> file.</entry> </row> + <row> + <entry><literal>CopyFileRange</literal></entry> + <entry>Waiting for range to be copied during a file copy operation.</entry> + </row> <row> <entry><literal>CopyFileRead</literal></entry> <entry>Waiting for a read during a file copy operation.</entry> diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 82f77536b4..497d357d8c 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -126,6 +126,14 @@ copy_file(const char *fromfile, const char *tofile) /* Size of copy buffer (read and write requests) */ #define COPY_BUF_SIZE (8 * BLCKSZ) + /* + * Size of ranges when using copy_file_range(). We could in theory just + * use the whole file size, but we want to check for interrupts + * periodically while copying. We don't want to make it too small though, + * to give the operating system the chance to clone large extents. + */ +#define COPY_FILE_RANGE_CHUNK_SIZE (1024 * 1024) + /* * Size of data flush requests. It seems beneficial on most platforms to * do this every 1MB or so. But macOS, at least with early releases of @@ -138,8 +146,13 @@ copy_file(const char *fromfile, const char *tofile) #define FLUSH_DISTANCE (1024 * 1024) #endif +#ifdef HAVE_COPY_FILE_RANGE + /* Don't allocate the buffer unless we have to fall back to read/write. */ + buffer = NULL; +#else /* Use palloc to ensure we get a maxaligned buffer */ buffer = palloc(COPY_BUF_SIZE); +#endif /* * Open the files @@ -176,27 +189,67 @@ copy_file(const char *fromfile, const char *tofile) flush_offset = offset; } - pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ); - nbytes = read(srcfd, buffer, COPY_BUF_SIZE); - pgstat_report_wait_end(); - if (nbytes < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not read file \"%s\": %m", fromfile))); - if (nbytes == 0) - break; - errno = 0; - pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE); - if ((int) write(dstfd, buffer, nbytes) != nbytes) + nbytes = 0; /* silence compiler */ + +#ifdef HAVE_COPY_FILE_RANGE + if (buffer == NULL) + { + pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_RANGE); + nbytes = copy_file_range(srcfd, NULL, dstfd, NULL, + COPY_FILE_RANGE_CHUNK_SIZE, 0); + pgstat_report_wait_end(); + + if (nbytes < 0) + { + if (errno == EXDEV) + { + /* + * Linux < 5.3 fails like this for cross-filesystem copies. + * Allocate the buffer to fall back to read/write mode. + */ + buffer = palloc(COPY_BUF_SIZE); + } + else + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not copy to file \"%s\": %m", tofile))); + } + } +#endif + + if (buffer) { - /* if write didn't set errno, assume problem is no disk space */ - if (errno == 0) - errno = ENOSPC; - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not write to file \"%s\": %m", tofile))); + pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ); + nbytes = read(srcfd, buffer, COPY_BUF_SIZE); + pgstat_report_wait_end(); + + if (nbytes < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not read file \"%s\": %m", fromfile))); + + if (nbytes > 0) + { + errno = 0; + pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE); + if ((int) write(dstfd, buffer, nbytes) != nbytes) + { + /* + * If write didn't set errno, assume problem is no disk + * space. + */ + if (errno == 0) + errno = ENOSPC; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", tofile))); + } + pgstat_report_wait_end(); + } } - pgstat_report_wait_end(); + + if (nbytes == 0) + break; } if (offset > flush_offset) @@ -212,5 +265,6 @@ copy_file(const char *fromfile, const char *tofile) (errcode_for_file_access(), errmsg("could not close file \"%s\": %m", fromfile))); - pfree(buffer); + if (buffer) + pfree(buffer); } diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c index 7940d64639..9c3cd088c0 100644 --- a/src/backend/utils/activity/wait_event.c +++ b/src/backend/utils/activity/wait_event.c @@ -567,6 +567,9 @@ pgstat_get_wait_io(WaitEventIO w) case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE: event_name = "ControlFileWriteUpdate"; break; + case WAIT_EVENT_COPY_FILE_RANGE: + event_name = "CopyFileRange"; + break; case WAIT_EVENT_COPY_FILE_READ: event_name = "CopyFileRead"; break; diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h index 518d3b0a1f..517de1544b 100644 --- a/src/include/utils/wait_event.h +++ b/src/include/utils/wait_event.h @@ -172,6 +172,7 @@ typedef enum WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE, WAIT_EVENT_CONTROL_FILE_WRITE, WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE, + WAIT_EVENT_COPY_FILE_RANGE, WAIT_EVENT_COPY_FILE_READ, WAIT_EVENT_COPY_FILE_WRITE, WAIT_EVENT_DATA_FILE_EXTEND, -- 2.40.1
From f83a0a9f80614e18b780e7636e5c2e567b2f701e Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 30 Apr 2023 15:36:20 +1200 Subject: [PATCH 08/11] Teach copy_file() to concatenate segmented files. This means that relations are automatically converted to large file format during COPY DATABASE ... STRATEGY=FILE_COPY and ALTER TABLE ... SET TABLESPACE operations. --- src/backend/storage/file/copydir.c | 43 +++++++++++++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c index 497d357d8c..0b472f1ac2 100644 --- a/src/backend/storage/file/copydir.c +++ b/src/backend/storage/file/copydir.c @@ -71,7 +71,19 @@ copydir(const char *fromdir, const char *todir, bool recurse) copydir(fromfile, tofile, true); } else if (xlde_type == PGFILETYPE_REG) + { + const char *s; + + /* + * Skip legacy segment files ending in ".N". copy_file() will deal + * with those. + */ + s = strrchr(fromfile, '.'); + if (s && strspn(s + 1, "0123456789") == strlen(s + 1)) + continue; + copy_file(fromfile, tofile); + } } FreeDir(xldir); @@ -117,6 +129,7 @@ void copy_file(const char *fromfile, const char *tofile) { char *buffer; + int segno; int srcfd; int dstfd; int nbytes; @@ -154,6 +167,8 @@ copy_file(const char *fromfile, const char *tofile) buffer = palloc(COPY_BUF_SIZE); #endif + segno = 0; + /* * Open the files */ @@ -248,8 +263,34 @@ copy_file(const char *fromfile, const char *tofile) } } + /* + * If we ran out of source data on the expected boundary of a legacy + * relation file segment, try opening the next segment. + */ if (nbytes == 0) - break; + { + char nextpath[MAXPGPATH]; + int nextfd; + + if (offset % (RELSEG_SIZE * BLCKSZ) != 0) + break; + + snprintf(nextpath, sizeof(nextpath), "%s.%d", fromfile, ++segno); + nextfd = OpenTransientFile(nextpath, O_RDONLY | PG_BINARY); + if (nextfd < 0) + { + if (errno == ENOENT) + break; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", nextpath))); + } + if (CloseTransientFile(srcfd) != 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not close file \"%s\": %m", fromfile))); + srcfd = nextfd; + } } if (offset > flush_offset) -- 2.40.1
From b435220922d7cd916f1b7acce313c8174738991c Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 30 Apr 2023 14:45:45 +1200 Subject: [PATCH 09/11] Use copy_file_range() in pg_upgrade. This gives the kernel the opportunity to copy or clone efficiently. We watch out for EXDEV and fall back to read/write for old Linux kernels. XXX Should we also let the user opt out? --- src/bin/pg_upgrade/file.c | 65 ++++++++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 14 deletions(-) diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c index d173602882..836b2bbbd0 100644 --- a/src/bin/pg_upgrade/file.c +++ b/src/bin/pg_upgrade/file.c @@ -9,6 +9,7 @@ #include "postgres_fe.h" +#include <limits.h> #include <sys/stat.h> #include <fcntl.h> #ifdef HAVE_COPYFILE_H @@ -98,32 +99,68 @@ copyFile(const char *src, const char *dst, /* copy in fairly large chunks for best efficiency */ #define COPY_BUF_SIZE (50 * BLCKSZ) +#ifdef HAVE_COPY_FILE_RANGE + buffer = NULL; +#else buffer = (char *) pg_malloc(COPY_BUF_SIZE); +#endif /* perform data copying i.e read src source, write to destination */ while (true) { - ssize_t nbytes = read(src_fd, buffer, COPY_BUF_SIZE); + ssize_t nbytes = 0; - if (nbytes < 0) - pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s", - schemaName, relName, src, strerror(errno)); +#ifdef HAVE_COPY_FILE_RANGE + if (buffer == NULL) + { + nbytes = copy_file_range(src_fd, NULL, dest_fd, NULL, SSIZE_MAX, 0); + if (nbytes < 0) + { + if (errno == EXDEV) + { + /* Linux < 5.3 might fail. Fall back to read/write. */ + buffer = (char *) pg_malloc(COPY_BUF_SIZE); + } + else + { + pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s", - if (nbytes == 0) - break; + schemaName, relName, src, strerror(errno)); + } + } + } +#endif - errno = 0; - if (write(dest_fd, buffer, nbytes) != nbytes) + if (buffer) { - /* if write didn't set errno, assume problem is no disk space */ - if (errno == 0) - errno = ENOSPC; - pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s", - schemaName, relName, dst, strerror(errno)); + nbytes = read(src_fd, buffer, COPY_BUF_SIZE); + + if (nbytes < 0) + pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s", + schemaName, relName, src, strerror(errno)); + if (nbytes > 0) + { + errno = 0; + if (write(dest_fd, buffer, nbytes) != nbytes) + { + /* + * If write didn't set errno, assume problem is no disk + * space. + */ + if (errno == 0) + errno = ENOSPC; + pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s", + schemaName, relName, dst, strerror(errno)); + } + } } + + if (nbytes == 0) + break; } - pg_free(buffer); + if (buffer) + pg_free(buffer); close(src_fd); close(dest_fd); -- 2.40.1
From 8683941485516e594174f8cb04d437962e4698f8 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 30 Apr 2023 16:05:46 +1200 Subject: [PATCH 10/11] Teach pg_upgrade to concatenate segmented files. When using copy mode, segmented relation forks are automatically concatenated into modern large format. When using hard link or clone mode, segment files continue to exist in the destination cluster. We lose the ability to use the Windows CopyFile() optimization, because it doesn't support concatenation. XXX Could be restored as a way of copying segment 0. XXX Allow user to opt out of concatenation for copy mode too? --- src/bin/pg_upgrade/file.c | 40 ++++++++++++++++++++---------- src/bin/pg_upgrade/relfilenumber.c | 4 +++ 2 files changed, 31 insertions(+), 13 deletions(-) diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c index 836b2bbbd0..b4e991f95d 100644 --- a/src/bin/pg_upgrade/file.c +++ b/src/bin/pg_upgrade/file.c @@ -82,10 +82,11 @@ void copyFile(const char *src, const char *dst, const char *schemaName, const char *relName) { -#ifndef WIN32 int src_fd; int dest_fd; char *buffer; + pgoff_t total_bytes = 0; + int segno = 0; if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0) pg_fatal("error while copying relation \"%s.%s\": could not open file \"%s\": %s", @@ -155,25 +156,38 @@ copyFile(const char *src, const char *dst, } } + total_bytes += nbytes; + if (nbytes == 0) - break; + { + char next_path[MAXPGPATH]; + int next_fd; + + /* If not at a segment boundary size, this must be the end. */ + if (total_bytes % (RELSEG_SIZE * BLCKSZ) != 0) + break; + + /* Is there another segment? */ + snprintf(next_path, sizeof(next_path), "%s.%d", src, ++segno); + next_fd = open(next_path, O_RDONLY | PG_BINARY, 0); + if (next_fd < 0) + { + if (errno == ENOENT) + break; + pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s", + schemaName, relName, next_path, strerror(errno)); + } + + /* Yes. Start copying from that one. */ + close(src_fd); + src_fd = next_fd; + } } if (buffer) pg_free(buffer); close(src_fd); close(dest_fd); - -#else /* WIN32 */ - - if (CopyFile(src, dst, true) == 0) - { - _dosmaperr(GetLastError()); - pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s", - schemaName, relName, src, dst, strerror(errno)); - } - -#endif /* WIN32 */ } diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c index 34bc9c1504..ea2abfb00f 100644 --- a/src/bin/pg_upgrade/relfilenumber.c +++ b/src/bin/pg_upgrade/relfilenumber.c @@ -185,6 +185,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro */ for (segno = 0;; segno++) { + /* Copy mode knows how to find higher numbered segments itself. */ + if (user_opts.transfer_mode == TRANSFER_MODE_COPY && segno > 0) + break; + if (segno == 0) extent_suffix[0] = '\0'; else -- 2.40.1
From fc3316b064486d5c15009fc98771a0686914609a Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 2 May 2023 11:15:10 +1200 Subject: [PATCH 11/11] Teach basebackup to concatenate segmented files. Since basebackups have to read and write all relations, they have an opportunity to convert to large file format on the fly. Take it. XXX There may be some bugs hiding in here when sizeof(ssize_t) < sizeof(pgoff_t)? --- src/backend/backup/basebackup.c | 92 +++++++++++++++++++++++++-------- 1 file changed, 71 insertions(+), 21 deletions(-) diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c index 2dcc04fef2..e2534895eb 100644 --- a/src/backend/backup/basebackup.c +++ b/src/backend/backup/basebackup.c @@ -1339,6 +1339,17 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, continue; /* don't recurse into pg_wal */ } + /* + * Skip relation segment files because sendFile() will find them when + * called for the initial segment. + */ + if (isDbDir) + { + const char *s = strrchr(de->d_name, '.'); + if (s && strspn(s + 1, "0123456789") == strlen(s + 1)) + continue; + } + /* Allow symbolic links in pg_tblspc only */ if (strcmp(path, "./pg_tblspc") == 0 && S_ISLNK(statbuf.st_mode)) { @@ -1476,6 +1487,10 @@ is_checksummed_file(const char *fullpath, const char *filename) * If dboid is anything other than InvalidOid then any checksum failures * detected will get reported to the cumulative stats system. * + * If the file is multi-segmented, the segments are concatenated and sent as + * one file. On return, statbuf->st_size contains the complete size of the + * single sent file. + * * Returns true if the file was successfully sent, false if 'missing_ok', * and the file did not exist. */ @@ -1495,10 +1510,34 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, char *page; PageHeader phdr; int segmentno = 0; - char *segmentpath; + int nsegments = 1; bool verify_checksum = false; pg_checksum_context checksum_ctx; + /* + * This function in only called for the head segment of segmented files, + * but we want to concatenate it on the fly into a large file. If we + * have reached a segment boundary, we'll try to open the next segment. + * We count the segments and sum their sizes into statbuf->st_size. + */ + while (statbuf->st_size == (pgoff_t) nsegments * RELSEG_SIZE * BLCKSZ) + { + char nextpath[MAXPGPATH]; + struct stat nextstat; + + snprintf(nextpath, sizeof(nextpath), "%s.%d", readfilename, nsegments); + if (lstat(nextpath, &nextstat) < 0) + { + if (errno == ENOENT) + break; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not stat file \"%s\": %m", nextpath))); + } + ++nsegments; /* count segment */ + statbuf->st_size += nextstat.st_size; /* sum size */ + } + if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0) elog(ERROR, "could not initialize checksum of file \"%s\"", readfilename); @@ -1527,23 +1566,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, filename = last_dir_separator(readfilename) + 1; if (is_checksummed_file(readfilename, filename)) - { verify_checksum = true; - - /* - * Cut off at the segment boundary (".") to get the segment number - * in order to mix it into the checksum. - */ - segmentpath = strstr(filename, "."); - if (segmentpath != NULL) - { - segmentno = atoi(segmentpath + 1); - if (segmentno == 0) - ereport(ERROR, - (errmsg("invalid segment number %d in file \"%s\"", - segmentno, filename))); - } - } } /* @@ -1554,7 +1577,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, */ while (len < statbuf->st_size) { - size_t remaining = statbuf->st_size - len; + pgoff_t remaining = statbuf->st_size - len; /* Try to read some more data. */ cnt = basebackup_read_file(fd, sink->bbs_buffer, @@ -1676,10 +1699,37 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, /* * If we hit end-of-file, a concurrent truncation must have occurred. * That's not an error condition, because WAL replay will fix things - * up. + * up. It might also mean that we need to move to the next input + * segment. */ if (cnt == 0) + { + /* Are we at the end of a segment? Try to open the next one. */ + if (len == ((pgoff_t) segmentno + 1) * RELSEG_SIZE * BLCKSZ) + { + char nextpath[MAXPGPATH]; + int nextfd; + + /* Try to open the next segment. */ + nextfd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY); + if (nextfd < 0) + { + if (errno == ENOENT) + break; + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", nextpath))); + } + + close(fd); + fd = nextfd; + ++segmentno; + continue; + } + + /* Otherwise we're at the end of input data. */ break; + } /* Archive the data we just read. */ bbsink_archive_contents(sink, cnt); @@ -1695,8 +1745,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename, /* If the file was truncated while we were sending it, pad it with zeros */ while (len < statbuf->st_size) { - size_t remaining = statbuf->st_size - len; - size_t nbytes = Min(sink->bbs_buffer_length, remaining); + pgoff_t remaining = statbuf->st_size - len; + pgoff_t nbytes = Min(sink->bbs_buffer_length, remaining); MemSet(sink->bbs_buffer, 0, nbytes); if (pg_checksum_update(&checksum_ctx, -- 2.40.1