Large files for relations

Thomas Munro Mon, 01 May 2023 18:30:00 -0700

Big PostgreSQL databases use and regularly open/close huge numbers of
file descriptors and directory entries for various anachronistic
reasons, one of which is the 1GB RELSEG_SIZE thing.  The segment
management code is trickier that you might think and also still
harbours known bugs.


A nearby analysis of yet another obscure segment life cycle bug
reminded me of this patch set to switch to simple large files and
eventually drop all that.  I originally meant to develop the attached
sketch-quality code further and try proposing it in the 16 cycle,
while I was down the modernisation rabbit hole[1], but then I got side
tracked: at some point I believed that the 56 bit relfilenode thing
might be necessary for correctness, but then I found a set of rules
that seem to hold up without that.  I figured I might as well post
what I have early in the 17 cycle as a "concept" patch to see which
way the flames blow.

There are various boring details due to Windows, and then a load of
fairly obvious changes, and then a whole can of worms about how we'd
handle the transition for the world's fleet of existing databases.
I'll cut straight to that part.  Different choices on aggressiveness
could be made, but here are the straw-man answers I came up with so
far:

1.  All new relations would be in large format only.  No 16384.N
files, just 16384 that can grow to MaxBlockNumber * BLCKSZ.

2.  The existence of a file 16384.1 means that this smgr relation is
in legacy segmented format that came from pg_upgrade (note that we
don't unlink that file once it exists, even when truncating the fork,
until we eventually drop the relation).

3.  Forks that were pg_upgrade'd from earlier releases using hard
links or reflinks would implicitly be in large format if they only had
one segment, and otherwise they could stay in the traditional format
for a grace period of N major releases, after which we'd plan to drop
segment support.  pg_upgrade's [ref]link mode would therefore be the
only way to get a segmented relation, other than a developer-only
trick for testing/debugging.

4.  Every opportunity to convert a multi-segment fork to large format
would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE,
VACUUM FULL, TRUNCATE, etc.  You can see approximately working sketch
versions of all the cases I thought of so far in the attached.

5.  The main places that do file-level copying of relations would use
copy_file_range() to do the splicing, so that on file systems that are
smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and
destination, the operation can be very fast, and other degrees of
optimisation are available to the kernel too even for file systems
without block sharing magic (pushing down block range copies to
hardware/network storage, etc).  The copy_file_range() stuff could
also be proposed independently (I vaguely recall it was discussed a
few times before), it's just that it really comes into its own when
you start splicing files together, as needed here, and it's also been
adopted by FreeBSD with the same interface as Linux and has an
efficient implementation in bleeding edge ZFS there.

Stepping back, the main ideas are: (1) for some users of large
databases, it would be painlessly done at upgrade time without even
really noticing, using modern file system facilities where possible
for speed; (2) for anyone who wants to defer that because of lack of
fast copy_file_range() and a desire to avoid prolonged downtime by
using links or reflinks, concatenation can be put off for the next N
releases, giving a total of 5 + N years of option to defer the work,
and in that case there are also many ways to proactively change to
large format before the time comes with varying degrees of granularity
and disruption.  For example, set up a new replica and fail over, or
VACUUM FULL tables one at a time, etc.

There are plenty of things left to do in this patch set: pg_rewind
doesn't understand optional segmentation yet, there are probably more
things like that, and I expect there are some ssize_t vs pgoff_t
confusions I missed that could bite a 32 bit system.  But you can see
the basics working on a typical system.

I am not aware of any modern/non-historic filesystem[2] that can't do
large files with ease.  Anyone know of anything to worry about on that
front?  I think the main collateral damage would be weird old external
tools like some weird old version of Windows tar I occasionally see
mentioned, that sort of thing, but that'd just be another case of
"well don't use that then", I guess?  What else might we need to think
about, outside PostgreSQL?

What other problems might occur inside PostgreSQL?  Clearly we'd need
to figure out a decent strategy to automate testing of all of the
relevant transitions.  We could test the splicing code paths with an
optional test suite that you might enable along with a small segment
size (as we're already testing on CI and probably BF after the last
round of segmentation bugs).  To test the messy Windows off_t API
stuff convincingly, we'd need actual > 4GB files, I think?  Maybe
doable cheaply with file system hole punching tricks.

Speaking of file system holes, this patch set doesn't touch buffile.c
That code wants to use segments for two extra purposes: (1) parallel
create index merges workers' output using segmentation tricks as if
there were holes in the file; this could perhaps be replaced with
large files that make use of actual OS-level holes but I didn't feel
like additionally claiming that all computers have spare files --
perhaps another approach is needed anyway; (2) buffile.c deliberately
spreads large buffiles around across multiple temporary tablespaces
using segments supposedly for space management reasons.  So although
it initially looks like a nice safe little place to start using large
files, we'd need an answer to those design choices first.

/me dons flameproof suit and goes back to working on LLVM problems for a while

[1] https://wiki.postgresql.org/wiki/AllComputers
[2] https://en.wikipedia.org/wiki/Comparison_of_file_systems

From b4b6f27af1d196f9d6b3b8d5991216666cf2900f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Mon, 24 Apr 2023 18:04:43 +1200
Subject: [PATCH 01/11] Assert that pgoff_t is wide enough.

On Windows, we know it's wide enough because we define it directly ourselves.
On Unix, we use off_t, which may only be 32 bits wide on some systems,
depending on compiler switches or macros.  Make absolutely certain that we are
not confused on this point with an assertion, or we'd corrupt large files.
---
 src/backend/storage/file/fd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 277a28fc13..053588a302 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -102,6 +102,9 @@
 #include "utils/resowner_private.h"
 #include "utils/varlena.h"
 
+StaticAssertDecl(sizeof(pgoff_t) >= 8,
+				 "pgoff_t not big enough to support large files");
+
 /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
 #if defined(HAVE_SYNC_FILE_RANGE)
 #define PG_FLUSH_DATA_WORKS 1
-- 
2.40.1

From 6154e35d35515a7536524b79cb7ccd6a39d41afe Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 5 Mar 2023 11:24:51 +1300
Subject: [PATCH 02/11] Use pgoff_t in system call replacements on Windows.

All modern Unix systems have 64 bit off_t, but Windows does not.  Use
our pgoff_t type in our POSIX-style replacement functions (lseek(),
ftruncate(), pread(), pwrite() etc etc).  Also in closely related
functions like pg_pwrite_zeros().
---
 configure                       |  6 +++
 configure.ac                    |  1 +
 src/common/file_utils.c         |  4 +-
 src/include/common/file_utils.h |  4 +-
 src/include/port.h              |  2 +-
 src/include/port/pg_iovec.h     |  4 +-
 src/include/port/win32_port.h   | 23 ++++++++++--
 src/port/meson.build            |  1 +
 src/port/preadv.c               |  2 +-
 src/port/pwritev.c              |  2 +-
 src/port/win32ftruncate.c       | 65 +++++++++++++++++++++++++++++++++
 src/port/win32pread.c           |  3 +-
 src/port/win32pwrite.c          |  3 +-
 src/tools/msvc/Mkvcbuild.pm     |  1 +
 14 files changed, 106 insertions(+), 15 deletions(-)
 create mode 100644 src/port/win32ftruncate.c

diff --git a/configure b/configure
index 15daccc87f..47ba18491c 100755
--- a/configure
+++ b/configure
@@ -16537,6 +16537,12 @@ esac
  ;;
 esac
 
+  case " $LIBOBJS " in
+  *" win32ftruncate.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS win32ftruncate.$ac_objext"
+ ;;
+esac
+
   case " $LIBOBJS " in
   *" win32getrusage.$ac_objext "* ) ;;
   *) LIBOBJS="$LIBOBJS win32getrusage.$ac_objext"
diff --git a/configure.ac b/configure.ac
index 97f5be6c73..2b3b1b4dca 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1905,6 +1905,7 @@ if test "$PORTNAME" = "win32"; then
   AC_LIBOBJ(win32env)
   AC_LIBOBJ(win32error)
   AC_LIBOBJ(win32fdatasync)
+  AC_LIBOBJ(win32ftruncate)
   AC_LIBOBJ(win32getrusage)
   AC_LIBOBJ(win32link)
   AC_LIBOBJ(win32ntdll)
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 74833c4acb..7a63434bc4 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -469,7 +469,7 @@ get_dirent_type(const char *path,
  * error is returned, it is unspecified how much has been written.
  */
 ssize_t
-pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	struct iovec iov_copy[PG_IOV_MAX];
 	ssize_t		sum = 0;
@@ -538,7 +538,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size, off_t offset)
+pg_pwrite_zeros(int fd, size_t size, pgoff_t offset)
 {
 	static const PGIOAlignedBlock zbuffer = {{0}};	/* worth BLCKSZ */
 	void	   *zerobuf_addr = unconstify(PGIOAlignedBlock *, &zbuffer)->data;
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index b7efa1226d..534277b12d 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -42,8 +42,8 @@ extern PGFileType get_dirent_type(const char *path,
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-									 off_t offset);
+									 pgoff_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, pgoff_t offset);
 
 #endif							/* FILE_UTILS_H */
diff --git a/src/include/port.h b/src/include/port.h
index a88d403483..f7707a390e 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -368,7 +368,7 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
  * When necessary, these routines are provided by files in src/port/.
  */
 
-/* Type to use with fseeko/ftello */
+/* Type to use with lseek/ftruncate/pread/fseeko/ftello */
 #ifndef WIN32					/* WIN32 is handled in port/win32_port.h */
 #define pgoff_t off_t
 #endif
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 689799c425..c762fab662 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -43,13 +43,13 @@ struct iovec
 #if HAVE_DECL_PREADV
 #define pg_preadv preadv
 #else
-extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset);
 #endif
 
 #if HAVE_DECL_PWRITEV
 #define pg_pwritev pwritev
 #else
-extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset);
 #endif
 
 #endif							/* PG_IOVEC_H */
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 58965e0dfd..c757687386 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -76,11 +76,19 @@
 #undef fstat
 #undef stat
 
+/* and likewise for lseek hack */
+#define lseek microsoft_native_lseek
+#include <io.h>
+#undef lseek
+
+/* and also ftruncate, as defined by MinGW headers with 32 bit offset */
+#define ftruncate mingw_native_ftruncate
+#include <unistd.h>
+#undef ftruncate
+
 /* Must be here to avoid conflicting with prototype in windows.h */
 #define mkdir(a,b)	mkdir(a)
 
-#define ftruncate(a,b)	chsize(a,b)
-
 /* Windows doesn't have fsync() as such, use _commit() */
 #define fsync(fd) _commit(fd)
 
@@ -219,6 +227,7 @@ extern int	_pgfseeko64(FILE *stream, pgoff_t offset, int origin);
 extern pgoff_t _pgftello64(FILE *stream);
 #define fseeko(stream, offset, origin) _pgfseeko64(stream, offset, origin)
 #define ftello(stream) _pgftello64(stream)
+#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin))
 #else
 #ifndef fseeko
 #define fseeko(stream, offset, origin) fseeko64(stream, offset, origin)
@@ -226,7 +235,13 @@ extern pgoff_t _pgftello64(FILE *stream);
 #ifndef ftello
 #define ftello(stream) ftello64(stream)
 #endif
+#ifndef lseek
+#define lseek(fd, offset, origin) _lseeki64((fd), (offset), (origin))
 #endif
+#endif
+
+/* 64 bit ftruncate is in win32ftruncate.c */
+extern int ftruncate(int fd, pgoff_t length);
 
 /*
  *	Win32 also doesn't have symlinks, but we can emulate them with
@@ -586,9 +601,9 @@ typedef unsigned short mode_t;
 #endif
 
 /* in port/win32pread.c */
-extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, pgoff_t offset);
 
 /* in port/win32pwrite.c */
-extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
+extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, pgoff_t offset);
 
 #endif							/* PG_WIN32_PORT_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index 24416b9bfc..54ce59806a 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -35,6 +35,7 @@ if host_system == 'windows'
     'win32error.c',
     'win32fdatasync.c',
     'win32fseek.c',
+    'win32ftruncate.c',
     'win32getrusage.c',
     'win32link.c',
     'win32ntdll.c',
diff --git a/src/port/preadv.c b/src/port/preadv.c
index e762283e67..6e5e92234f 100644
--- a/src/port/preadv.c
+++ b/src/port/preadv.c
@@ -19,7 +19,7 @@
 #include "port/pg_iovec.h"
 
 ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	ssize_t		sum = 0;
 	ssize_t		part;
diff --git a/src/port/pwritev.c b/src/port/pwritev.c
index 519de45037..c430f99806 100644
--- a/src/port/pwritev.c
+++ b/src/port/pwritev.c
@@ -19,7 +19,7 @@
 #include "port/pg_iovec.h"
 
 ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, pgoff_t offset)
 {
 	ssize_t		sum = 0;
 	ssize_t		part;
diff --git a/src/port/win32ftruncate.c b/src/port/win32ftruncate.c
new file mode 100644
index 0000000000..5e6d4f3e92
--- /dev/null
+++ b/src/port/win32ftruncate.c
@@ -0,0 +1,65 @@
+/*-------------------------------------------------------------------------
+ *
+ * win32ftruncate.c
+ *	   Win32 ftruncate() replacement
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ *
+ * src/port/win32ftruncate.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
+#include "postgres.h"
+#endif
+
+int
+ftruncate(int fd, pgoff_t length)
+{
+	HANDLE		handle;
+	pgoff_t		save_position;
+
+	/*
+	 * We can't use chsize() because it works with 32 bit off_t.  We can't use
+	 * _chsize_s() because it isn't available in MinGW.  So we have to use
+	 * SetEndOfFile(), but that works with the current position.  So we save
+	 * and restore it.
+	 */
+
+	handle = (HANDLE) _get_osfhandle(fd);
+	if (handle == INVALID_HANDLE_VALUE)
+	{
+		errno = EBADF;
+		return -1;
+	}
+
+	save_position = lseek(fd, 0, SEEK_CUR);
+	if (save_position < 0)
+		return -1;
+
+	if (lseek(fd, length, SEEK_SET) < 0)
+	{
+		int			save_errno = errno;
+		lseek(fd, save_position, SEEK_SET);
+		errno = save_errno;
+		return -1;
+	}
+
+	if (!SetEndOfFile(handle))
+	{
+		int			save_errno;
+
+		_dosmaperr(GetLastError());
+		save_errno = errno;
+		lseek(fd, save_position, SEEK_SET);
+		errno = save_errno;
+		return -1;
+	}
+	lseek(fd, save_position, SEEK_SET);
+
+	return 0;
+}
diff --git a/src/port/win32pread.c b/src/port/win32pread.c
index 905cf9f42b..6e6366faaa 100644
--- a/src/port/win32pread.c
+++ b/src/port/win32pread.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pread(int fd, void *buf, size_t size, off_t offset)
+pg_pread(int fd, void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -32,6 +32,7 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 
 	/* Note that this changes the file position, despite not using it. */
 	overlapped.Offset = offset;
+	overlapped.OffsetHigh = offset >> 32;
 	if (!ReadFile(handle, buf, size, &result, &overlapped))
 	{
 		if (GetLastError() == ERROR_HANDLE_EOF)
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 5dd10821cf..90dd93dbc5 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -17,7 +17,7 @@
 #include <windows.h>
 
 ssize_t
-pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
+pg_pwrite(int fd, const void *buf, size_t size, pgoff_t offset)
 {
 	OVERLAPPED	overlapped = {0};
 	HANDLE		handle;
@@ -32,6 +32,7 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 
 	/* Note that this changes the file position, despite not using it. */
 	overlapped.Offset = offset;
+	overlapped.OffsetHigh = offset >> 32;
 	if (!WriteFile(handle, buf, size, &result, &overlapped))
 	{
 		_dosmaperr(GetLastError());
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 958206f315..4b96c2bb44 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -113,6 +113,7 @@ sub mkvcbuild
 	  win32env.c win32error.c
 	  win32fdatasync.c
 	  win32fseek.c
+	  win32ftruncate.c
 	  win32getrusage.c
 	  win32gettimeofday.c
 	  win32link.c
-- 
2.40.1

From 2782d8c1b5c6ff266488536c49cb3a4d4a7b4da6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 5 Mar 2023 11:27:16 +1300
Subject: [PATCH 03/11] Support large files on Windows in our VFD API.

All fd.c interfaces that take off_t now need to use pgoff_t instead,
because we can't use Windows' 32 bit off_t.
---
 src/backend/storage/file/fd.c | 30 +++++++++++++++---------------
 src/include/storage/fd.h      | 20 ++++++++++----------
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 053588a302..f5e194a797 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -204,7 +204,7 @@ typedef struct vfd
 	File		nextFree;		/* link to next free VFD, if in freelist */
 	File		lruMoreRecently;	/* doubly linked recency-of-use list */
 	File		lruLessRecently;
-	off_t		fileSize;		/* current size of file (0 if not temporary) */
+	pgoff_t		fileSize;		/* current size of file (0 if not temporary) */
 	char	   *fileName;		/* name of file, or NULL for unused VFD */
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
@@ -463,7 +463,7 @@ pg_fdatasync(int fd)
  * offset of 0 with nbytes 0 means that the entire file should be flushed
  */
 void
-pg_flush_data(int fd, off_t offset, off_t nbytes)
+pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes)
 {
 	/*
 	 * Right now file flushing is primarily used to avoid making later
@@ -636,7 +636,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
  * Truncate a file to a given length by name.
  */
 int
-pg_truncate(const char *path, off_t length)
+pg_truncate(const char *path, pgoff_t length)
 {
 #ifdef WIN32
 	int			save_errno;
@@ -1439,7 +1439,7 @@ FileAccess(File file)
  * Called whenever a temporary file is deleted to report its size.
  */
 static void
-ReportTemporaryFileUsage(const char *path, off_t size)
+ReportTemporaryFileUsage(const char *path, pgoff_t size)
 {
 	pgstat_report_tempfile(size);
 
@@ -1989,7 +1989,7 @@ FileClose(File file)
  * to read into.
  */
 int
-FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_WILLNEED)
 	int			returnCode;
@@ -2017,7 +2017,7 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 }
 
 void
-FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
+FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info)
 {
 	int			returnCode;
 
@@ -2043,7 +2043,7 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
+FileRead(File file, void *buffer, size_t amount, pgoff_t offset,
 		 uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2099,7 +2099,7 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset,
 		  uint32 wait_event_info)
 {
 	int			returnCode;
@@ -2128,7 +2128,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		pgoff_t		past_write = offset + amount;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2160,7 +2160,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			pgoff_t		past_write = offset + amount;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2224,7 +2224,7 @@ FileSync(File file, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 	int			returnCode;
 	ssize_t		written;
@@ -2269,7 +2269,7 @@ FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
  * appropriate error.
  */
 int
-FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info)
 {
 #ifdef HAVE_POSIX_FALLOCATE
 	int			returnCode;
@@ -2305,7 +2305,7 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	return FileZero(file, offset, amount, wait_event_info);
 }
 
-off_t
+pgoff_t
 FileSize(File file)
 {
 	Assert(FileIsValid(file));
@@ -2316,14 +2316,14 @@ FileSize(File file)
 	if (FileIsNotOpen(file))
 	{
 		if (FileAccess(file) < 0)
-			return (off_t) -1;
+			return (pgoff_t) -1;
 	}
 
 	return lseek(VfdCache[file].fd, 0, SEEK_END);
 }
 
 int
-FileTruncate(File file, off_t offset, uint32 wait_event_info)
+FileTruncate(File file, pgoff_t offset, uint32 wait_event_info)
 {
 	int			returnCode;
 
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..a4528428ff 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -110,16 +110,16 @@ extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
-extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FilePrefetch(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileRead(File file, void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info);
+extern int	FileWrite(File file, const void *buffer, size_t amount, pgoff_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
-extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileZero(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, pgoff_t offset, pgoff_t amount, uint32 wait_event_info);
 
-extern off_t FileSize(File file);
-extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
-extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
+extern pgoff_t FileSize(File file);
+extern int	FileTruncate(File file, pgoff_t offset, uint32 wait_event_info);
+extern void FileWriteback(File file, pgoff_t offset, pgoff_t nbytes, uint32 wait_event_info);
 extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
@@ -186,8 +186,8 @@ extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
-extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
-extern int	pg_truncate(const char *path, off_t length);
+extern void pg_flush_data(int fd, pgoff_t offset, pgoff_t nbytes);
+extern int	pg_truncate(const char *path, pgoff_t length);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 extern int	durable_rename(const char *oldfile, const char *newfile, int elevel);
-- 
2.40.1

From ed3a5558a03afaabb7c4c206c053c288c104cb02 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 5 Mar 2023 12:36:55 +1300
Subject: [PATCH 04/11] Use pgoff_t instead of off_t in more places.

XXX  Incomplete
---
 src/backend/access/heap/rewriteheap.c | 2 +-
 src/backend/backup/basebackup.c       | 7 ++++---
 src/backend/storage/file/copydir.c    | 4 ++--
 src/bin/pg_basebackup/receivelog.c    | 2 +-
 src/bin/pg_rewind/file_ops.c          | 4 ++--
 src/bin/pg_rewind/file_ops.h          | 4 ++--
 src/bin/pg_rewind/filemap.c           | 2 ++
 src/bin/pg_rewind/libpq_source.c      | 6 +++---
 src/bin/pg_rewind/local_source.c      | 8 ++++----
 src/bin/pg_rewind/pg_rewind.c         | 2 +-
 src/bin/pg_rewind/rewind_source.h     | 2 +-
 src/include/access/heapam_xlog.h      | 2 +-
 12 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 424958912c..5e5b00d25a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -194,7 +194,7 @@ typedef struct RewriteMappingFile
 {
 	TransactionId xid;			/* xid that might need to see the row */
 	int			vfd;			/* fd of mappings file */
-	off_t		off;			/* how far have we written yet */
+	pgoff_t		off;			/* how far have we written yet */
 	dclist_head mappings;		/* list of in-memory mappings */
 	char		path[MAXPGPATH];	/* path, for error messages */
 } RewriteMappingFile;
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5baea7535b..2dcc04fef2 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -95,7 +95,8 @@ static void perform_base_backup(basebackup_options *opt, bbsink *sink);
 static void parse_basebackup_options(List *options, basebackup_options *opt);
 static int	compareWalFileNames(const ListCell *a, const ListCell *b);
 static bool is_checksummed_file(const char *fullpath, const char *filename);
-static int	basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+static int	basebackup_read_file(int fd, char *buf, size_t nbytes,
+								 pgoff_t offset,
 								 const char *filename, bool partial_read_ok);
 
 /* Was the backup currently in-progress initiated in recovery mode? */
@@ -1488,7 +1489,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	bool		block_retry = false;
 	uint16		checksum;
 	int			checksum_failures = 0;
-	off_t		cnt;
+	pgoff_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
@@ -1827,7 +1828,7 @@ convert_link_to_directory(const char *pathbuf, struct stat *statbuf)
  * Returns the number of bytes read.
  */
 static int
-basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
+basebackup_read_file(int fd, char *buf, size_t nbytes, pgoff_t offset,
 					 const char *filename, bool partial_read_ok)
 {
 	int			rc;
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index e04bc3941a..82f77536b4 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -120,8 +120,8 @@ copy_file(const char *fromfile, const char *tofile)
 	int			srcfd;
 	int			dstfd;
 	int			nbytes;
-	off_t		offset;
-	off_t		flush_offset;
+	pgoff_t		offset;
+	pgoff_t		flush_offset;
 
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 504d82bef6..e69ad912a2 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -192,7 +192,7 @@ static bool
 close_walfile(StreamCtl *stream, XLogRecPtr pos)
 {
 	char	   *fn;
-	off_t		currpos;
+	pgoff_t		currpos;
 	int			r;
 	char		walfile_name[MAXPGPATH];
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 25996b4da4..3e96b8b0a8 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -85,7 +85,7 @@ close_target_file(void)
 }
 
 void
-write_target_range(char *buf, off_t begin, size_t size)
+write_target_range(char *buf, pgoff_t begin, size_t size)
 {
 	size_t		writeleft;
 	char	   *p;
@@ -203,7 +203,7 @@ remove_target_file(const char *path, bool missing_ok)
 }
 
 void
-truncate_target_file(const char *path, off_t newsize)
+truncate_target_file(const char *path, pgoff_t newsize)
 {
 	char		dstpath[MAXPGPATH];
 	int			fd;
diff --git a/src/bin/pg_rewind/file_ops.h b/src/bin/pg_rewind/file_ops.h
index 427cf8e0b5..41a41cb6cb 100644
--- a/src/bin/pg_rewind/file_ops.h
+++ b/src/bin/pg_rewind/file_ops.h
@@ -13,10 +13,10 @@
 #include "filemap.h"
 
 extern void open_target_file(const char *path, bool trunc);
-extern void write_target_range(char *buf, off_t begin, size_t size);
+extern void write_target_range(char *buf, pgoff_t begin, size_t size);
 extern void close_target_file(void);
 extern void remove_target_file(const char *path, bool missing_ok);
-extern void truncate_target_file(const char *path, off_t newsize);
+extern void truncate_target_file(const char *path, pgoff_t newsize);
 extern void create_target(file_entry_t *entry);
 extern void remove_target(file_entry_t *entry);
 extern void sync_target_dir(void);
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index bd5c598e20..a5855ccaa9 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -296,6 +296,8 @@ process_target_wal_block_change(ForkNumber forknum, RelFileLocator rlocator,
 	BlockNumber blkno_inseg;
 	int			segno;
 
+	/* XXX We need to know if it is segmented! */
+
 	segno = blkno / RELSEG_SIZE;
 	blkno_inseg = blkno % RELSEG_SIZE;
 
diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c
index 5f486b2a61..d4832ccb76 100644
--- a/src/bin/pg_rewind/libpq_source.c
+++ b/src/bin/pg_rewind/libpq_source.c
@@ -30,7 +30,7 @@
 typedef struct
 {
 	const char *path;			/* path relative to data directory root */
-	off_t		offset;
+	pgoff_t		offset;
 	size_t		length;
 } fetch_range_request;
 
@@ -65,7 +65,7 @@ static void libpq_traverse_files(rewind_source *source,
 								 process_file_callback_t callback);
 static void libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len);
 static void libpq_queue_fetch_range(rewind_source *source, const char *path,
-									off_t off, size_t len);
+									pgoff_t off, size_t len);
 static void libpq_finish_fetch(rewind_source *source);
 static char *libpq_fetch_file(rewind_source *source, const char *path,
 							  size_t *filesize);
@@ -343,7 +343,7 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
  * Queue up a request to fetch a piece of a file from remote system.
  */
 static void
-libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+libpq_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off,
 						size_t len)
 {
 	libpq_source *src = (libpq_source *) source;
diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c
index 4e2a1376c6..fb84309c12 100644
--- a/src/bin/pg_rewind/local_source.c
+++ b/src/bin/pg_rewind/local_source.c
@@ -32,7 +32,7 @@ static char *local_fetch_file(rewind_source *source, const char *path,
 static void local_queue_fetch_file(rewind_source *source, const char *path,
 								   size_t len);
 static void local_queue_fetch_range(rewind_source *source, const char *path,
-									off_t off, size_t len);
+									pgoff_t off, size_t len);
 static void local_finish_fetch(rewind_source *source);
 static void local_destroy(rewind_source *source);
 
@@ -125,15 +125,15 @@ local_queue_fetch_file(rewind_source *source, const char *path, size_t len)
  * Copy a file from source to target, starting at 'off', for 'len' bytes.
  */
 static void
-local_queue_fetch_range(rewind_source *source, const char *path, off_t off,
+local_queue_fetch_range(rewind_source *source, const char *path, pgoff_t off,
 						size_t len)
 {
 	const char *datadir = ((local_source *) source)->datadir;
 	PGIOAlignedBlock buf;
 	char		srcpath[MAXPGPATH];
 	int			srcfd;
-	off_t		begin = off;
-	off_t		end = off + len;
+	pgoff_t		begin = off;
+	pgoff_t		end = off + len;
 
 	snprintf(srcpath, sizeof(srcpath), "%s/%s", datadir, path);
 
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f7f3b8227f..500842e169 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -566,7 +566,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		{
 			datapagemap_iterator_t *iter;
 			BlockNumber blkno;
-			off_t		offset;
+			pgoff_t		offset;
 
 			iter = datapagemap_iterate(&entry->target_pages_to_overwrite);
 			while (datapagemap_next(iter, &blkno))
diff --git a/src/bin/pg_rewind/rewind_source.h b/src/bin/pg_rewind/rewind_source.h
index 69ad0e495f..e17526ce86 100644
--- a/src/bin/pg_rewind/rewind_source.h
+++ b/src/bin/pg_rewind/rewind_source.h
@@ -45,7 +45,7 @@ typedef struct rewind_source
 	 * queue and execute all requests.
 	 */
 	void		(*queue_fetch_range) (struct rewind_source *, const char *path,
-									  off_t offset, size_t len);
+									  pgoff_t offset, size_t len);
 
 	/*
 	 * Like queue_fetch_range(), but requests replacing the whole local file
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index a038450787..d82cd027f4 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -396,7 +396,7 @@ typedef struct xl_heap_rewrite_mapping
 	TransactionId mapped_xid;	/* xid that might need to see the row */
 	Oid			mapped_db;		/* DbOid or InvalidOid for shared rels */
 	Oid			mapped_rel;		/* Oid of the mapped relation */
-	off_t		offset;			/* How far have we written so far */
+	pgoff_t		offset;			/* How far have we written so far */
 	uint32		num_mappings;	/* Number of in-memory mappings */
 	XLogRecPtr	start_lsn;		/* Insert LSN at begin of rewrite */
 } xl_heap_rewrite_mapping;
-- 
2.40.1

From d22479403d02944e6c2569897816137f8582c6f1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 5 Mar 2023 11:51:15 +1300
Subject: [PATCH 05/11] Use large files for relation storage.

Traditionally we broke files up into 1Gb segments (configurable) to
support older OSes before the industry transition to "large files" in
the mid 90s.  These days, the only remaining consideration on living
operating systems is that Windows still has 32 bit types in a few
interfaces, but we deal with that by being careful to use pgoff_t
everywhere instead of off_t.

Having many segment files creates extra work for the kernel, which must
manage many more descriptors, and extra work for PostgreSQL, which must
close and reopen them to stay under per-process descriptor limits.

With this patch, all new relations will be non-segmented.  The only way
to have a segmented relation is to inherit it via pg_upgrade.  For some
number of releases, legacy segmented relations will be supported, and
can be upgraded to non-segmented format by any operation that rewrites
the relation, creating a new relfilenode (VACUUM FULL, etc).
---
 src/backend/storage/smgr/md.c | 227 +++++++++++++++++++++++++++-------
 src/include/storage/smgr.h    |   1 +
 2 files changed, 181 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e982a8dd7f..005a7a15bf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -42,6 +42,14 @@
 #include "utils/memutils.h"
 
 /*
+ *  The magnetic disk storage manager assumes that the operating system
+ *  supports "large files".  Historically, this wasn't the case, so there is
+ *  support for "segmented" files that were upgraded from earlier releases.
+ *  A future release may eventually drop support for those.  See
+ *  md_fork_is_segmented() for details.
+ *
+ *  The following paragraphs describe the historical behavior.
+ *
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
  *	easier to support relations that are larger than the operating
@@ -119,6 +127,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 /* don't try to open a segment, if not already open */
 #define EXTENSION_DONT_OPEN			(1 << 5)
 
+#define MD_FORK_SEGMENTED_UNKNOWN	'u'
+#define MD_FORK_SEGMENTED_FALSE		'f'
+#define MD_FORK_SEGMENTED_TRUE		't'
 
 /* local routines */
 static void mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum,
@@ -139,8 +150,11 @@ static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forknum,
 							  BlockNumber segno, int oflags);
 static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
 							 BlockNumber blkno, bool skipFsync, int behavior);
+static pgoff_t getseekpos(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
 							  MdfdVec *seg);
+static bool md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum);
 
 static inline int
 _mdfd_open_flags(void)
@@ -459,7 +473,7 @@ void
 mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		 const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -486,10 +500,7 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 						InvalidBlockNumber)));
 
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync, EXTENSION_CREATE);
-
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
 	{
@@ -511,7 +522,8 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	if (!skipFsync && !SmgrIsTemp(reln))
 		register_dirty_segment(reln, forknum, v);
 
-	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+	if (md_fork_is_segmented(reln, forknum))
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
 /*
@@ -549,20 +561,30 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 
 	while (remblocks > 0)
 	{
-		BlockNumber	segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
-		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		BlockNumber	segstartblock;
+		pgoff_t		seekpos;
 		int			numblocks;
 
-		if (segstartblock + remblocks > RELSEG_SIZE)
-			numblocks = RELSEG_SIZE - segstartblock;
+		if (md_fork_is_segmented(reln, forknum))
+		{
+			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+			seekpos = (pgoff_t) BLCKSZ * segstartblock;
+			if (segstartblock + remblocks > RELSEG_SIZE)
+				numblocks = RELSEG_SIZE - segstartblock;
+			else
+				numblocks = remblocks;
+			Assert(segstartblock < RELSEG_SIZE);
+			Assert(segstartblock + numblocks <= RELSEG_SIZE);
+		}
 		else
+		{
+			segstartblock = curblocknum;
+			seekpos = (pgoff_t) BLCKSZ * segstartblock;
 			numblocks = remblocks;
+		}
 
 		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
 
-		Assert(segstartblock < RELSEG_SIZE);
-		Assert(segstartblock + numblocks <= RELSEG_SIZE);
-
 		/*
 		 * If available and useful, use posix_fallocate() (via FileAllocate())
 		 * to extend the relation. That's often more efficient than using
@@ -579,7 +601,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			int			ret;
 
 			ret = FileFallocate(v->mdfd_vfd,
-								seekpos, (off_t) BLCKSZ * numblocks,
+								seekpos, (pgoff_t) BLCKSZ * numblocks,
 								WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret != 0)
 			{
@@ -602,7 +624,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 			 * zeroed buffer for the whole length of the extension.
 			 */
 			ret = FileZero(v->mdfd_vfd,
-						   seekpos, (off_t) BLCKSZ * numblocks,
+						   seekpos, (pgoff_t) BLCKSZ * numblocks,
 						   WAIT_EVENT_DATA_FILE_EXTEND);
 			if (ret < 0)
 				ereport(ERROR,
@@ -615,7 +637,8 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 		if (!skipFsync && !SmgrIsTemp(reln))
 			register_dirty_segment(reln, forknum, v);
 
-		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+		if (md_fork_is_segmented(reln, forknum))
+			Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
 		remblocks -= numblocks;
 		curblocknum += numblocks;
@@ -644,7 +667,6 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 		return &reln->md_seg_fds[forknum][0];
 
 	path = relpath(reln->smgr_rlocator, forknum);
-
 	fd = PathNameOpenFile(path, _mdfd_open_flags());
 
 	if (fd < 0)
@@ -667,7 +689,8 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
 
-	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
+	if (md_fork_is_segmented(reln, forknum))
+		Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
 	return mdfd;
 }
@@ -680,7 +703,10 @@ mdopen(SMgrRelation reln)
 {
 	/* mark it not open */
 	for (int forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_UNKNOWN;
 		reln->md_num_open_segs[forknum] = 0;
+	}
 }
 
 /*
@@ -713,7 +739,7 @@ bool
 mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 {
 #ifdef USE_PREFETCH
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	MdfdVec    *v;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
@@ -723,9 +749,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 	if (v == NULL)
 		return false;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
 #endif							/* USE_PREFETCH */
@@ -752,10 +776,8 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 	while (nblocks > 0)
 	{
 		BlockNumber nflush = nblocks;
-		off_t		seekpos;
+		pgoff_t		seekpos;
 		MdfdVec    *v;
-		int			segnum_start,
-					segnum_end;
 
 		v = _mdfd_getseg(reln, forknum, blocknum, true /* not used */ ,
 						 EXTENSION_DONT_OPEN);
@@ -770,20 +792,26 @@ mdwriteback(SMgrRelation reln, ForkNumber forknum,
 		if (!v)
 			return;
 
-		/* compute offset inside the current segment */
-		segnum_start = blocknum / RELSEG_SIZE;
+		if (md_fork_is_segmented(reln, forknum))
+		{
+			int			segnum_start,
+						segnum_end;
+
+			/* compute offset inside the current segment */
+			segnum_start = blocknum / RELSEG_SIZE;
 
-		/* compute number of desired writes within the current segment */
-		segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
-		if (segnum_start != segnum_end)
-			nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
+			/* compute number of desired writes within the current segment */
+			segnum_end = (blocknum + nblocks - 1) / RELSEG_SIZE;
+			if (segnum_start != segnum_end)
+				nflush = RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-		Assert(nflush >= 1);
-		Assert(nflush <= nblocks);
+			Assert(nflush >= 1);
+			Assert(nflush <= nblocks);
+		}
 
-		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = getseekpos(reln, forknum, blocknum);
 
-		FileWriteback(v->mdfd_vfd, seekpos, (off_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
+		FileWriteback(v->mdfd_vfd, seekpos, (pgoff_t) BLCKSZ * nflush, WAIT_EVENT_DATA_FILE_FLUSH);
 
 		nblocks -= nflush;
 		blocknum += nflush;
@@ -797,7 +825,7 @@ void
 mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   void *buffer)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -814,9 +842,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, false,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
 
@@ -866,7 +892,7 @@ void
 mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		const void *buffer, bool skipFsync)
 {
-	off_t		seekpos;
+	pgoff_t		seekpos;
 	int			nbytes;
 	MdfdVec    *v;
 
@@ -888,9 +914,7 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
 					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
-
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	seekpos = getseekpos(reln, forknum, blocknum);
 
 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
 
@@ -962,6 +986,13 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	for (;;)
 	{
 		nblocks = _mdnblocks(reln, forknum, v);
+
+		if (!md_fork_is_segmented(reln, forknum))
+		{
+			Assert(segno == 0);
+			return nblocks;
+		}
+
 		if (nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 		if (nblocks < ((BlockNumber) RELSEG_SIZE))
@@ -1013,6 +1044,25 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	if (nblocks == curnblk)
 		return;					/* no work */
 
+	if (!md_fork_is_segmented(reln, forknum))
+	{
+		MdfdVec    *v;
+
+		Assert(reln->md_num_open_segs[forknum] == 1);
+		v = &reln->md_seg_fds[forknum][0];
+
+		if (FileTruncate(v->mdfd_vfd, (pgoff_t) nblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not truncate file \"%s\" to %u blocks: %m",
+							FilePathName(v->mdfd_vfd),
+							nblocks)));
+		if (!SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		return;
+	}
+
 	/*
 	 * Truncate segments, starting at the last one. Starting at the end makes
 	 * managing the memory for the fd array easier, should there be errors.
@@ -1058,7 +1108,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 			 */
 			BlockNumber lastsegblocks = nblocks - priorblocks;
 
-			if (FileTruncate(v->mdfd_vfd, (off_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
+			if (FileTruncate(v->mdfd_vfd, (pgoff_t) lastsegblocks * BLCKSZ, WAIT_EVENT_DATA_FILE_TRUNCATE) < 0)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not truncate file \"%s\" to %u blocks: %m",
@@ -1396,7 +1446,10 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 		   (EXTENSION_FAIL | EXTENSION_CREATE | EXTENSION_RETURN_NULL |
 			EXTENSION_DONT_OPEN));
 
-	targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	if (md_fork_is_segmented(reln, forknum))
+		targetseg = blkno / ((BlockNumber) RELSEG_SIZE);
+	else
+		targetseg = 0;
 
 	/* if an existing and opened segment, we're done */
 	if (targetseg < reln->md_num_open_segs[forknum])
@@ -1433,7 +1486,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 
 		Assert(nextsegno == v->mdfd_segno + 1);
 
-		if (nblocks > ((BlockNumber) RELSEG_SIZE))
+		if (md_fork_is_segmented(reln, forknum) &&
+			nblocks > ((BlockNumber) RELSEG_SIZE))
 			elog(FATAL, "segment too big");
 
 		if ((behavior & EXTENSION_CREATE) ||
@@ -1493,6 +1547,9 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 							blkno, nblocks)));
 		}
 
+		if (!md_fork_is_segmented(reln, forknum))
+			break;
+
 		v = _mdfd_openseg(reln, forknum, nextsegno, flags);
 
 		if (v == NULL)
@@ -1511,13 +1568,22 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 	return v;
 }
 
+static pgoff_t
+getseekpos(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	if (md_fork_is_segmented(reln, forknum))
+		return (pgoff_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	return (pgoff_t) BLCKSZ * blocknum;
+}
+
 /*
  * Get number of blocks present in a single disk file
  */
 static BlockNumber
 _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	off_t		len;
+	pgoff_t		len;
 
 	len = FileSize(seg->mdfd_vfd);
 	if (len < 0)
@@ -1618,3 +1684,70 @@ mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
 	 */
 	return ftag->rlocator.dbOid == candidate->rlocator.dbOid;
 }
+
+/*
+ * Is this fork in legacy segmented format, inherited from an easlier release
+ * via pg_upgrade?
+ */
+bool
+md_fork_is_segmented(SMgrRelation reln, ForkNumber forknum)
+{
+	char		path_probe[MAXPGPATH];
+	char	   *path;
+
+	Assert(forknum >= 0 && forknum <= MAX_FORKNUM);
+
+	/* Fast return if we have the answer cached. */
+	if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_FALSE)
+		return false;
+	if (reln->md_segmented[forknum] == MD_FORK_SEGMENTED_TRUE)
+		return true;
+
+	Assert(reln->md_segmented[forknum] == MD_FORK_SEGMENTED_UNKNOWN);
+
+	/*
+	 * All backends must agree, using only clues from the file system, and the
+	 * answer must not change for as long as this relation exists.  The
+	 * correctness of this strategy depends on the following properties:
+	 *
+	 * 1.  When segmented forks are truncated, their higher numbered segments
+	 *	   are truncated to size zero, but they still exist.  That is, higher
+	 *	   segments won't be unlinked for as long as the relation exists.
+	 *
+	 * 2.  We don't create new segmented relations, so the only way they can
+	 *	   exist is if we inherited them via pg_upgrade from an earlier
+	 *	   release.
+	 *
+	 * 3.  Relations that never had more than one segment and were pg_upgraded
+	 *	   are indistinguishable from newly created (non-segmented) relations.
+	 *
+	 * 4.  If the relfilenode is recycled for a later relation, all backends
+	 *	   will close all segments first before potentially reopening the next
+	 *	   generation, either via the sinval or ProcSignalBarrier cache
+	 *	   invalidation system.
+	 *
+	 * Therefore, it is safe for every backend to determine whether the fork is
+	 * segmented by checking the existence of a ".1" file.
+	 */
+	path = relpath(reln->smgr_rlocator, forknum);
+	snprintf(path_probe, sizeof(path_probe), "%s.1", path);
+	if (access(path_probe, F_OK) == 0)
+	{
+		pfree(path);
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_TRUE;
+		return true;
+	}
+	else if (errno == ENOENT)
+	{
+		pfree(path);
+		reln->md_segmented[forknum] = MD_FORK_SEGMENTED_FALSE;
+		return false;
+	}
+	pfree(path);
+
+	ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not read access in file \"%s\": %m",
+					path_probe)));
+	pg_unreachable();
+}
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..e352a035be 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -65,6 +65,7 @@ typedef struct SMgrRelationData
 	 * for md.c; per-fork arrays of the number of open segments
 	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
 	 */
+	char		md_segmented[MAX_FORKNUM + 1];
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
-- 
2.40.1

From d1ffce7141cd34eff9d0d3f65f5e18f472b6d813 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 30 Apr 2023 10:38:46 +1200
Subject: [PATCH 06/11] Detect copy_file_range() function.

---
 configure                  | 2 +-
 configure.ac               | 1 +
 meson.build                | 1 +
 src/include/pg_config.h.in | 3 +++
 src/tools/msvc/Solution.pm | 1 +
 5 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/configure b/configure
index 47ba18491c..7d351b9614 100755
--- a/configure
+++ b/configure
@@ -15700,7 +15700,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index 2b3b1b4dca..ddb82e9433 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	copy_file_range
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/meson.build b/meson.build
index 096044628c..c06e4f9290 100644
--- a/meson.build
+++ b/meson.build
@@ -2404,6 +2404,7 @@ func_checks = [
   ['backtrace_symbols', {'dependencies': [execinfo_dep]}],
   ['clock_gettime', {'dependencies': [rt_dep, posix4_dep], 'define': false}],
   ['copyfile'],
+  ['copy_file_range'],
   # gcc/clang's sanitizer helper library provides dlopen but not dlsym, thus
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6d572c3820..0b26836f68 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -85,6 +85,9 @@
 /* Define to 1 if you have the <copyfile.h> header file. */
 #undef HAVE_COPYFILE_H
 
+/* Define to 1 if you have the `copy_file_range' function. */
+#undef HAVE_COPY_FILE_RANGE
+
 /* Define to 1 if you have the <crtdefs.h> header file. */
 #undef HAVE_CRTDEFS_H
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index ef10cda576..671d958af7 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -230,6 +230,7 @@ sub GenerateFiles
 		HAVE_COMPUTED_GOTO         => undef,
 		HAVE_COPYFILE              => undef,
 		HAVE_COPYFILE_H            => undef,
+		HAVE_COPY_FILE_RANGE       => undef,
 		HAVE_CRTDEFS_H             => undef,
 		HAVE_CRYPTO_LOCK           => undef,
 		HAVE_DECL_FDATASYNC        => 0,
-- 
2.40.1

From d89cbae1851627be4e146efedc92ba9d0a67ad6a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 30 Apr 2023 11:10:08 +1200
Subject: [PATCH 07/11] Use copy_file_range() to implement copy_file().

If copy_file_range() is available, use it to implement copy_file(), so
that the operating system has opportunities for efficient copying,
block cloning and pushdown.  This affects the commands CREATE DATABASE
STRATEGY=FILE_COPY and ALTER TABLE SET TABLESPACE, which perform bulk
file copies.

On older Linux systems, copy_file_range() might fail with EXDEV, so we
look out for that and fall back to the traditional read/write loop.

XXX Should we also let the user opt out?
---
 doc/src/sgml/monitoring.sgml            |  4 ++
 src/backend/storage/file/copydir.c      | 94 +++++++++++++++++++------
 src/backend/utils/activity/wait_event.c |  3 +
 src/include/utils/wait_event.h          |  1 +
 4 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..2161b32b17 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1317,6 +1317,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting for a write to update the <filename>pg_control</filename>
        file.</entry>
      </row>
+     <row>
+      <entry><literal>CopyFileRange</literal></entry>
+      <entry>Waiting for range to be copied during a file copy operation.</entry>
+     </row>
      <row>
       <entry><literal>CopyFileRead</literal></entry>
       <entry>Waiting for a read during a file copy operation.</entry>
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 82f77536b4..497d357d8c 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -126,6 +126,14 @@ copy_file(const char *fromfile, const char *tofile)
 	/* Size of copy buffer (read and write requests) */
 #define COPY_BUF_SIZE (8 * BLCKSZ)
 
+	/*
+	 * Size of ranges when using copy_file_range().  We could in theory just
+	 * use the whole file size, but we want to check for interrupts
+	 * periodically while copying.  We don't want to make it too small though,
+	 * to give the operating system the chance to clone large extents.
+	 */
+#define COPY_FILE_RANGE_CHUNK_SIZE (1024 * 1024)
+
 	/*
 	 * Size of data flush requests.  It seems beneficial on most platforms to
 	 * do this every 1MB or so.  But macOS, at least with early releases of
@@ -138,8 +146,13 @@ copy_file(const char *fromfile, const char *tofile)
 #define FLUSH_DISTANCE (1024 * 1024)
 #endif
 
+#ifdef HAVE_COPY_FILE_RANGE
+	/* Don't allocate the buffer unless we have to fall back to read/write. */
+	buffer = NULL;
+#else
 	/* Use palloc to ensure we get a maxaligned buffer */
 	buffer = palloc(COPY_BUF_SIZE);
+#endif
 
 	/*
 	 * Open the files
@@ -176,27 +189,67 @@ copy_file(const char *fromfile, const char *tofile)
 			flush_offset = offset;
 		}
 
-		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
-		nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
-		pgstat_report_wait_end();
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m", fromfile)));
-		if (nbytes == 0)
-			break;
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
-		if ((int) write(dstfd, buffer, nbytes) != nbytes)
+		nbytes = 0;			/* silence compiler */
+
+#ifdef HAVE_COPY_FILE_RANGE
+		if (buffer == NULL)
+		{
+			pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_RANGE);
+			nbytes = copy_file_range(srcfd, NULL, dstfd, NULL,
+									 COPY_FILE_RANGE_CHUNK_SIZE, 0);
+			pgstat_report_wait_end();
+
+			if (nbytes < 0)
+			{
+				if (errno == EXDEV)
+				{
+					/*
+					 * Linux < 5.3 fails like this for cross-filesystem copies.
+					 * Allocate the buffer to fall back to read/write mode.
+					 */
+					buffer = palloc(COPY_BUF_SIZE);
+				}
+				else
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not copy to file \"%s\": %m", tofile)));
+			}
+		}
+#endif
+
+		if (buffer)
 		{
-			/* if write didn't set errno, assume problem is no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tofile)));
+			pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
+			nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
+			pgstat_report_wait_end();
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read file \"%s\": %m", fromfile)));
+
+			if (nbytes > 0)
+			{
+				errno = 0;
+				pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+				if ((int) write(dstfd, buffer, nbytes) != nbytes)
+				{
+					/*
+					 * If write didn't set errno, assume problem is no disk
+					 * space.
+					 */
+					if (errno == 0)
+						errno = ENOSPC;
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not write to file \"%s\": %m", tofile)));
+				}
+				pgstat_report_wait_end();
+			}
 		}
-		pgstat_report_wait_end();
+
+		if (nbytes == 0)
+			break;
 	}
 
 	if (offset > flush_offset)
@@ -212,5 +265,6 @@ copy_file(const char *fromfile, const char *tofile)
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", fromfile)));
 
-	pfree(buffer);
+	if (buffer)
+		pfree(buffer);
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 7940d64639..9c3cd088c0 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -567,6 +567,9 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
 			event_name = "ControlFileWriteUpdate";
 			break;
+		case WAIT_EVENT_COPY_FILE_RANGE:
+			event_name = "CopyFileRange";
+			break;
 		case WAIT_EVENT_COPY_FILE_READ:
 			event_name = "CopyFileRead";
 			break;
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 518d3b0a1f..517de1544b 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -172,6 +172,7 @@ typedef enum
 	WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
 	WAIT_EVENT_CONTROL_FILE_WRITE,
 	WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+	WAIT_EVENT_COPY_FILE_RANGE,
 	WAIT_EVENT_COPY_FILE_READ,
 	WAIT_EVENT_COPY_FILE_WRITE,
 	WAIT_EVENT_DATA_FILE_EXTEND,
-- 
2.40.1

From f83a0a9f80614e18b780e7636e5c2e567b2f701e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 30 Apr 2023 15:36:20 +1200
Subject: [PATCH 08/11] Teach copy_file() to concatenate segmented files.

This means that relations are automatically converted to large file
format during COPY DATABASE ... STRATEGY=FILE_COPY and ALTER TABLE ...
SET TABLESPACE operations.
---
 src/backend/storage/file/copydir.c | 43 +++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 497d357d8c..0b472f1ac2 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -71,7 +71,19 @@ copydir(const char *fromdir, const char *todir, bool recurse)
 				copydir(fromfile, tofile, true);
 		}
 		else if (xlde_type == PGFILETYPE_REG)
+		{
+			const char *s;
+
+			/*
+			 * Skip legacy segment files ending in ".N".  copy_file() will deal
+			 * with those.
+			 */
+			s = strrchr(fromfile, '.');
+			if (s && strspn(s + 1, "0123456789") == strlen(s + 1))
+				continue;
+
 			copy_file(fromfile, tofile);
+		}
 	}
 	FreeDir(xldir);
 
@@ -117,6 +129,7 @@ void
 copy_file(const char *fromfile, const char *tofile)
 {
 	char	   *buffer;
+	int			segno;
 	int			srcfd;
 	int			dstfd;
 	int			nbytes;
@@ -154,6 +167,8 @@ copy_file(const char *fromfile, const char *tofile)
 	buffer = palloc(COPY_BUF_SIZE);
 #endif
 
+	segno = 0;
+
 	/*
 	 * Open the files
 	 */
@@ -248,8 +263,34 @@ copy_file(const char *fromfile, const char *tofile)
 			}
 		}
 
+		/*
+		 * If we ran out of source data on the expected boundary of a legacy
+		 * relation file segment, try opening the next segment.
+		 */
 		if (nbytes == 0)
-			break;
+		{
+			char		nextpath[MAXPGPATH];
+			int			nextfd;
+
+			if (offset % (RELSEG_SIZE * BLCKSZ) != 0)
+				break;
+
+			snprintf(nextpath, sizeof(nextpath), "%s.%d", fromfile, ++segno);
+			nextfd = OpenTransientFile(nextpath, O_RDONLY | PG_BINARY);
+			if (nextfd < 0)
+			{
+				if (errno == ENOENT)
+					break;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not open file \"%s\": %m", nextpath)));
+			}
+			if (CloseTransientFile(srcfd) != 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", fromfile)));
+			srcfd = nextfd;
+		}
 	}
 
 	if (offset > flush_offset)
-- 
2.40.1

From b435220922d7cd916f1b7acce313c8174738991c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 30 Apr 2023 14:45:45 +1200
Subject: [PATCH 09/11] Use copy_file_range() in pg_upgrade.

This gives the kernel the opportunity to copy or clone efficiently.
We watch out for EXDEV and fall back to read/write for old Linux
kernels.

XXX Should we also let the user opt out?
---
 src/bin/pg_upgrade/file.c | 65 ++++++++++++++++++++++++++++++---------
 1 file changed, 51 insertions(+), 14 deletions(-)

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index d173602882..836b2bbbd0 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -9,6 +9,7 @@
 
 #include "postgres_fe.h"
 
+#include <limits.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #ifdef HAVE_COPYFILE_H
@@ -98,32 +99,68 @@ copyFile(const char *src, const char *dst,
 	/* copy in fairly large chunks for best efficiency */
 #define COPY_BUF_SIZE (50 * BLCKSZ)
 
+#ifdef HAVE_COPY_FILE_RANGE
+	buffer = NULL;
+#else
 	buffer = (char *) pg_malloc(COPY_BUF_SIZE);
+#endif
 
 	/* perform data copying i.e read src source, write to destination */
 	while (true)
 	{
-		ssize_t		nbytes = read(src_fd, buffer, COPY_BUF_SIZE);
+		ssize_t		nbytes = 0;
 
-		if (nbytes < 0)
-			pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
-					 schemaName, relName, src, strerror(errno));
+#ifdef HAVE_COPY_FILE_RANGE
+		if (buffer == NULL)
+		{
+			nbytes = copy_file_range(src_fd, NULL, dest_fd, NULL, SSIZE_MAX, 0);
+			if (nbytes < 0)
+			{
+				if (errno == EXDEV)
+				{
+					/* Linux < 5.3 might fail.  Fall back to read/write. */
+					buffer = (char *) pg_malloc(COPY_BUF_SIZE);
+				}
+				else
+				{
+					pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
 
-		if (nbytes == 0)
-			break;
+							 schemaName, relName, src, strerror(errno));
+				}
+			}
+		}
+#endif
 
-		errno = 0;
-		if (write(dest_fd, buffer, nbytes) != nbytes)
+		if (buffer)
 		{
-			/* if write didn't set errno, assume problem is no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s",
-					 schemaName, relName, dst, strerror(errno));
+			nbytes = read(src_fd, buffer, COPY_BUF_SIZE);
+
+			if (nbytes < 0)
+				pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
+						 schemaName, relName, src, strerror(errno));
+			if (nbytes > 0)
+			{
+				errno = 0;
+				if (write(dest_fd, buffer, nbytes) != nbytes)
+				{
+					/*
+					 * If write didn't set errno, assume problem is no disk
+					 * space.
+					 */
+					if (errno == 0)
+						errno = ENOSPC;
+					pg_fatal("error while copying relation \"%s.%s\": could not write file \"%s\": %s",
+							 schemaName, relName, dst, strerror(errno));
+				}
+			}
 		}
+
+		if (nbytes == 0)
+			break;
 	}
 
-	pg_free(buffer);
+	if (buffer)
+		pg_free(buffer);
 	close(src_fd);
 	close(dest_fd);
 
-- 
2.40.1

From 8683941485516e594174f8cb04d437962e4698f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 30 Apr 2023 16:05:46 +1200
Subject: [PATCH 10/11] Teach pg_upgrade to concatenate segmented files.

When using copy mode, segmented relation forks are automatically
concatenated into modern large format.

When using hard link or clone mode, segment files continue to exist in
the destination cluster.

We lose the ability to use the Windows CopyFile() optimization, because
it doesn't support concatenation.  XXX Could be restored as a way of
copying segment 0.

XXX Allow user to opt out of concatenation for copy mode too?
---
 src/bin/pg_upgrade/file.c          | 40 ++++++++++++++++++++----------
 src/bin/pg_upgrade/relfilenumber.c |  4 +++
 2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 836b2bbbd0..b4e991f95d 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -82,10 +82,11 @@ void
 copyFile(const char *src, const char *dst,
 		 const char *schemaName, const char *relName)
 {
-#ifndef WIN32
 	int			src_fd;
 	int			dest_fd;
 	char	   *buffer;
+	pgoff_t		total_bytes = 0;
+	int			segno = 0;
 
 	if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
 		pg_fatal("error while copying relation \"%s.%s\": could not open file \"%s\": %s",
@@ -155,25 +156,38 @@ copyFile(const char *src, const char *dst,
 			}
 		}
 
+		total_bytes += nbytes;
+
 		if (nbytes == 0)
-			break;
+		{
+			char next_path[MAXPGPATH];
+			int next_fd;
+
+			/* If not at a segment boundary size, this must be the end. */
+			if (total_bytes % (RELSEG_SIZE * BLCKSZ) != 0)
+				break;
+
+			/* Is there another segment? */
+			snprintf(next_path, sizeof(next_path), "%s.%d", src, ++segno);
+			next_fd = open(next_path, O_RDONLY | PG_BINARY, 0);
+			if (next_fd < 0)
+			{
+				if (errno == ENOENT)
+					break;
+				pg_fatal("error while copying relation \"%s.%s\": could not read file \"%s\": %s",
+						 schemaName, relName, next_path, strerror(errno));
+			}
+
+			/* Yes.  Start copying from that one. */
+			close(src_fd);
+			src_fd = next_fd;
+		}
 	}
 
 	if (buffer)
 		pg_free(buffer);
 	close(src_fd);
 	close(dest_fd);
-
-#else							/* WIN32 */
-
-	if (CopyFile(src, dst, true) == 0)
-	{
-		_dosmaperr(GetLastError());
-		pg_fatal("error while copying relation \"%s.%s\" (\"%s\" to \"%s\"): %s",
-				 schemaName, relName, src, dst, strerror(errno));
-	}
-
-#endif							/* WIN32 */
 }
 
 
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 34bc9c1504..ea2abfb00f 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -185,6 +185,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 	 */
 	for (segno = 0;; segno++)
 	{
+		/* Copy mode knows how to find higher numbered segments itself. */
+		if (user_opts.transfer_mode == TRANSFER_MODE_COPY && segno > 0)
+			break;
+
 		if (segno == 0)
 			extent_suffix[0] = '\0';
 		else
-- 
2.40.1

From fc3316b064486d5c15009fc98771a0686914609a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Tue, 2 May 2023 11:15:10 +1200
Subject: [PATCH 11/11] Teach basebackup to concatenate segmented files.

Since basebackups have to read and write all relations, they have an
opportunity to convert to large file format on the fly.  Take it.

XXX There may be some bugs hiding in here when sizeof(ssize_t) <
sizeof(pgoff_t)?
---
 src/backend/backup/basebackup.c | 92 +++++++++++++++++++++++++--------
 1 file changed, 71 insertions(+), 21 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 2dcc04fef2..e2534895eb 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1339,6 +1339,17 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 			continue;			/* don't recurse into pg_wal */
 		}
 
+		/*
+		 * Skip relation segment files because sendFile() will find them when
+		 * called for the initial segment.
+		 */
+		if (isDbDir)
+		{
+			const char *s = strrchr(de->d_name, '.');
+			if (s && strspn(s + 1, "0123456789") == strlen(s + 1))
+				continue;
+		}
+
 		/* Allow symbolic links in pg_tblspc only */
 		if (strcmp(path, "./pg_tblspc") == 0 && S_ISLNK(statbuf.st_mode))
 		{
@@ -1476,6 +1487,10 @@ is_checksummed_file(const char *fullpath, const char *filename)
  * If dboid is anything other than InvalidOid then any checksum failures
  * detected will get reported to the cumulative stats system.
  *
+ * If the file is multi-segmented, the segments are concatenated and sent as
+ * one file.  On return, statbuf->st_size contains the complete size of the
+ * single sent file.
+ *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
  */
@@ -1495,10 +1510,34 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	char	   *page;
 	PageHeader	phdr;
 	int			segmentno = 0;
-	char	   *segmentpath;
+	int			nsegments = 1;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/*
+	 * This function in only called for the head segment of segmented files,
+	 * but we want to concatenate it on the fly into a large file.  If we
+	 * have reached a segment boundary, we'll try to open the next segment.
+	 * We count the segments and sum their sizes into statbuf->st_size.
+	 */
+	while (statbuf->st_size == (pgoff_t) nsegments * RELSEG_SIZE * BLCKSZ)
+	{
+		char nextpath[MAXPGPATH];
+		struct stat nextstat;
+
+		snprintf(nextpath, sizeof(nextpath), "%s.%d", readfilename, nsegments);
+		if (lstat(nextpath, &nextstat) < 0)
+		{
+			if (errno == ENOENT)
+				break;
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not stat file \"%s\": %m", nextpath)));
+		}
+		++nsegments;								/* count segment */
+		statbuf->st_size += nextstat.st_size;		/* sum size */
+	}
+
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 readfilename);
@@ -1527,23 +1566,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		filename = last_dir_separator(readfilename) + 1;
 
 		if (is_checksummed_file(readfilename, filename))
-		{
 			verify_checksum = true;
-
-			/*
-			 * Cut off at the segment boundary (".") to get the segment number
-			 * in order to mix it into the checksum.
-			 */
-			segmentpath = strstr(filename, ".");
-			if (segmentpath != NULL)
-			{
-				segmentno = atoi(segmentpath + 1);
-				if (segmentno == 0)
-					ereport(ERROR,
-							(errmsg("invalid segment number %d in file \"%s\"",
-									segmentno, filename)));
-			}
-		}
 	}
 
 	/*
@@ -1554,7 +1577,7 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	 */
 	while (len < statbuf->st_size)
 	{
-		size_t		remaining = statbuf->st_size - len;
+		pgoff_t		remaining = statbuf->st_size - len;
 
 		/* Try to read some more data. */
 		cnt = basebackup_read_file(fd, sink->bbs_buffer,
@@ -1676,10 +1699,37 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 		/*
 		 * If we hit end-of-file, a concurrent truncation must have occurred.
 		 * That's not an error condition, because WAL replay will fix things
-		 * up.
+		 * up.  It might also mean that we need to move to the next input
+		 * segment.
 		 */
 		if (cnt == 0)
+		{
+			/* Are we at the end of a segment?  Try to open the next one. */
+			if (len == ((pgoff_t) segmentno + 1) * RELSEG_SIZE * BLCKSZ)
+			{
+				char		nextpath[MAXPGPATH];
+				int			nextfd;
+
+				/* Try to open the next segment. */
+				nextfd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY);
+				if (nextfd < 0)
+				{
+					if (errno == ENOENT)
+						break;
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\": %m", nextpath)));
+				}
+
+				close(fd);
+				fd = nextfd;
+				++segmentno;
+				continue;
+			}
+
+			/* Otherwise we're at the end of input data. */
 			break;
+		}
 
 		/* Archive the data we just read. */
 		bbsink_archive_contents(sink, cnt);
@@ -1695,8 +1745,8 @@ sendFile(bbsink *sink, const char *readfilename, const char *tarfilename,
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	while (len < statbuf->st_size)
 	{
-		size_t		remaining = statbuf->st_size - len;
-		size_t		nbytes = Min(sink->bbs_buffer_length, remaining);
+		pgoff_t		remaining = statbuf->st_size - len;
+		pgoff_t		nbytes = Min(sink->bbs_buffer_length, remaining);
 
 		MemSet(sink->bbs_buffer, 0, nbytes);
 		if (pg_checksum_update(&checksum_ctx,
-- 
2.40.1

Large files for relations

Reply via email to