On 3/23/24 14:47, Tomas Vondra wrote:
> On 3/23/24 13:38, Robert Haas wrote:
>> On Fri, Mar 22, 2024 at 8:26 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
>>> Hmm, this discussion seems to assume that we only use
>>> copy_file_range() to copy/clone whole segment files, right? That's
>>> great and may even get most of the available benefit given typical
>>> databases with many segments of old data that never changes, but... I
>>> think copy_write_range() allows us to go further than the other
>>> whole-file clone techniques: we can stitch together parts of an old
>>> backup segment file and an incremental backup to create a new file.
>>> If you're interested in minimising disk use while also removing
>>> dependencies on the preceding chain of backups, then it might make
>>> sense to do that even if you *also* have to read the data to compute
>>> the checksums, I think? That's why I mentioned it: if
>>> copy_file_range() (ie sub-file-level block sharing) is a solution in
>>> search of a problem, has the world ever seen a better problem than
>>> pg_combinebackup?
>>
>> That makes sense; it's just a different part of the code than I
>> thought we were talking about.
>>
>
> Yeah, that's in write_reconstructed_file() and the patch does not touch
> that at all. I agree it would be nice to use copy_file_range() in this
> part too, and it doesn't seem it'd be that hard to do, I think.
>
> It seems we'd just need a "fork" that either calls pread/pwrite or
> copy_file_range, depending on checksums and what was requested.
>
Here's a patch to use copy_file_range in write_reconstructed_file too,
when requested/possible. One thing that I'm not sure about is whether to
do pg_fatal() if --copy-file-range but the platform does not support it.
This is more like what pg_upgrade does, but maybe we should just ignore
what the user requested and fallback to the regular copy (a bit like
when having to do a checksum for some files). Or maybe the check should
just happen earlier ...
I've been thinking about what Thomas wrote - that maybe it'd be good to
do copy_file_range() even when calculating the checksum, and I think he
may be right. But the current patch does not do that, and while it
doesn't seem very difficult to do (at least when reconstructing the file
from incremental backups) I don't have a very good intuition whether
it'd be a win or not in typical cases.
I have a naive question about the checksumming - if we used a
merkle-tree-like scheme, i.e. hashing blocks and not hashes of blocks,
wouldn't that allow calculating the hashes even without having to read
the blocks, making copy_file_range more efficient? Sure, it's more
complex, but a well known scheme. (OK, now I realized it'd mean we can't
use tools like sha224sum to hash the files and compare to manifest. I
guess that's why we don't do this ...)
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From d2b717d14638632c25d0e6919f5cd40333e9ebd0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sat, 23 Mar 2024 18:26:21 +0100
Subject: [PATCH v20240323-2 2/2] write_reconstructed_file
---
src/bin/pg_combinebackup/reconstruct.c | 32 +++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index f5c7af8a23c..4de92894bed 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -59,7 +59,8 @@ static void write_reconstructed_file(char *input_filename,
off_t *offsetmap,
pg_checksum_context *checksum_ctx,
bool debug,
- bool dry_run);
+ bool dry_run,
+ CopyMode copy_mode);
static void read_bytes(rfile *rf, void *buffer, unsigned length);
/*
@@ -325,7 +326,8 @@ reconstruct_from_incremental_file(char *input_filename,
{
write_reconstructed_file(input_filename, output_filename,
block_length, sourcemap, offsetmap,
- &checksum_ctx, debug, dry_run);
+ &checksum_ctx, debug, dry_run,
+ copy_method);
debug_reconstruction(n_prior_backups + 1, source, dry_run);
}
@@ -528,7 +530,8 @@ write_reconstructed_file(char *input_filename,
off_t *offsetmap,
pg_checksum_context *checksum_ctx,
bool debug,
- bool dry_run)
+ bool dry_run,
+ CopyMode copy_mode)
{
int wfd = -1;
unsigned i;
@@ -630,6 +633,29 @@ write_reconstructed_file(char *input_filename,
if (dry_run)
continue;
+ /*
+ * If requested, copy the block using copy_file_range.
+ *
+ * We can'd do this if the block needs to be zero-filled or when we
+ * need to update checksum.
+ */
+ if ((copy_mode == COPY_MODE_COPY_FILE_RANGE) &&
+ (s != NULL) && (checksum_ctx->type == CHECKSUM_TYPE_NONE))
+ {
+#if defined(HAVE_COPY_FILE_RANGE)
+ do
+ {
+ wb = copy_file_range(s->fd, &offsetmap[i], wfd, NULL, BLCKSZ, 0);
+ if (wb < 0)
+ pg_fatal("error while copying file range from \"%s\" to \"%s\": %m",
+ input_filename, output_filename);
+ } while (wb > 0);
+#else
+ pg_fatal("copy_file_range not supported on this platform");
+#endif
+ continue;
+ }
+
/* Read or zero-fill the block as appropriate. */
if (s == NULL)
{
--
2.44.0
From 558321b7ee10fa3902aaed1d1a08857865a232bb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Tue, 19 Mar 2024 15:16:29 +0100
Subject: [PATCH v20240323-2 1/2] pg_combinebackup - allow using
clone/copy_file_range
---
doc/src/sgml/ref/pg_combinebackup.sgml | 34 +++++
src/bin/pg_combinebackup/copy_file.c | 156 +++++++++++++++-----
src/bin/pg_combinebackup/copy_file.h | 18 ++-
src/bin/pg_combinebackup/pg_combinebackup.c | 18 ++-
src/bin/pg_combinebackup/reconstruct.c | 5 +-
src/bin/pg_combinebackup/reconstruct.h | 5 +-
6 files changed, 192 insertions(+), 44 deletions(-)
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
index 8a0a600c2b2..60a60e3fae6 100644
--- a/doc/src/sgml/ref/pg_combinebackup.sgml
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -185,6 +185,40 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--clone</option></term>
+ <listitem>
+ <para>
+ Use efficient file cloning (also known as <quote>reflinks</quote> on
+ some systems) instead of copying files to the new cluster. This can
+ result in near-instantaneous copying of the data files, giving the
+ speed advantages of <option>-k</option>/<option>--link</option> while
+ leaving the old cluster untouched.
+ </para>
+
+ <para>
+ File cloning is only supported on some operating systems and file
+ systems. If it is selected but not supported, the
+ <application>pg_combinebackup</application> run will error. At present,
+ it is supported on Linux (kernel 4.5 or later) with Btrfs and XFS (on
+ file systems created with reflink support), and on macOS with APFS.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><option>--copy-file-range</option></term>
+ <listitem>
+ <para>
+ Use the <function>copy_file_range</function> system call for efficient
+ copying. On some file systems this gives results similar to
+ <option>--clone</option>, sharing physical disk blocks, while on others
+ it may still copy blocks, but do so via an optimized path. At present,
+ it is supported on Linux and FreeBSD.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-V</option></term>
<term><option>--version</option></term>
diff --git a/src/bin/pg_combinebackup/copy_file.c b/src/bin/pg_combinebackup/copy_file.c
index e6d2423278a..0874679e53a 100644
--- a/src/bin/pg_combinebackup/copy_file.c
+++ b/src/bin/pg_combinebackup/copy_file.c
@@ -14,6 +14,7 @@
#include <copyfile.h>
#endif
#include <fcntl.h>
+#include <limits.h>
#include <sys/stat.h>
#include <unistd.h>
@@ -24,6 +25,10 @@
static void copy_file_blocks(const char *src, const char *dst,
pg_checksum_context *checksum_ctx);
+static void copy_file_clone(const char *src, const char *dst);
+
+static void copy_file_by_range(const char *src, const char *dst);
+
#ifdef WIN32
static void copy_file_copyfile(const char *src, const char *dst);
#endif
@@ -35,7 +40,8 @@ static void copy_file_copyfile(const char *src, const char *dst);
*/
void
copy_file(const char *src, const char *dst,
- pg_checksum_context *checksum_ctx, bool dry_run)
+ pg_checksum_context *checksum_ctx, bool dry_run,
+ CopyMode copy_mode)
{
/*
* In dry-run mode, we don't actually copy anything, nor do we read any
@@ -54,55 +60,68 @@ copy_file(const char *src, const char *dst,
/*
* If we don't need to compute a checksum, then we can use any special
* operating system primitives that we know about to copy the file; this
- * may be quicker than a naive block copy.
+ * may be quicker than a naive block copy. We only do this for WIN32.
+ * On other operating systems the user has to explicitly specify one of
+ * the available primitives - there may be multiple, we don't know which
+ * are reliable/preferred.
+ *
+ * On the other hand, if we need to compute a checksum, but the user
+ * requested a special copy method that does not support this, fallback
+ * to the default block-by-block copy. We don't want to fail if just
+ * one of many files requires checksum, etc.
*/
if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
{
- char *strategy_name = NULL;
- void (*strategy_implementation) (const char *, const char *) = NULL;
-
#ifdef WIN32
- strategy_name = "CopyFile";
- strategy_implementation = copy_file_copyfile;
+ copy_mode = COPY_MODE_COPYFILE;
#endif
-
- if (strategy_name != NULL)
- {
- if (dry_run)
- pg_log_debug("would copy \"%s\" to \"%s\" using strategy %s",
- src, dst, strategy_name);
- else
- {
- pg_log_debug("copying \"%s\" to \"%s\" using strategy %s",
- src, dst, strategy_name);
- (*strategy_implementation) (src, dst);
- }
- return;
- }
+ }
+ else
+ {
+ /* fallback to block-by-block copy */
+ copy_mode = COPY_MODE_COPY;
}
- /*
- * Fall back to the simple approach of reading and writing all the blocks,
- * feeding them into the checksum context as we go.
- */
if (dry_run)
{
- if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
- pg_log_debug("would copy \"%s\" to \"%s\"",
- src, dst);
- else
- pg_log_debug("would copy \"%s\" to \"%s\" and checksum with %s",
- src, dst, pg_checksum_type_name(checksum_ctx->type));
+ switch (copy_mode)
+ {
+ case COPY_MODE_CLONE:
+ pg_log_debug("would copy \"%s\" to \"%s\" (clone)", src, dst);
+ break;
+ case COPY_MODE_COPY:
+ pg_log_debug("would copy \"%s\" to \"%s\"", src, dst);
+ break;
+ case COPY_MODE_COPY_FILE_RANGE:
+ pg_log_debug("would copy \"%s\" to \"%s\" (copy_file_range)",
+ src, dst);
+#ifdef WIN32
+ case COPY_MODE_COPYFILE:
+ pg_log_debug("would copy \"%s\" to \"%s\" (copyfile)",
+ src, dst);
+ break;
+#endif
+ }
}
else
{
- if (checksum_ctx->type == CHECKSUM_TYPE_NONE)
- pg_log_debug("copying \"%s\" to \"%s\"",
- src, dst);
- else
- pg_log_debug("copying \"%s\" to \"%s\" and checksumming with %s",
- src, dst, pg_checksum_type_name(checksum_ctx->type));
- copy_file_blocks(src, dst, checksum_ctx);
+ switch (copy_mode)
+ {
+ case COPY_MODE_CLONE:
+ copy_file_clone(src, dst);
+ break;
+ case COPY_MODE_COPY:
+ copy_file_blocks(src, dst, checksum_ctx);
+ break;
+ case COPY_MODE_COPY_FILE_RANGE:
+ copy_file_by_range(src, dst);
+ break;
+#ifdef WIN32
+ case COPY_MODE_COPYFILE:
+ copy_file_copyfile(src, dst);
+ break;
+#endif
+ }
}
}
@@ -156,6 +175,67 @@ copy_file_blocks(const char *src, const char *dst,
close(dest_fd);
}
+static void
+copy_file_clone(const char *src, const char *dest)
+{
+#if defined(HAVE_COPYFILE) && defined(COPYFILE_CLONE_FORCE)
+ if (copyfile(src, dest, NULL, COPYFILE_CLONE_FORCE) < 0)
+ pg_fatal("error while cloning file \"%s\" to \"%s\": %m", src, dest);
+#elif defined(__linux__) && defined(FICLONE)
+ {
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dest, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not create file \"%s\": %m", dest);
+
+ if (ioctl(dest_fd, FICLONE, src_fd) < 0)
+ {
+ int save_errno = errno;
+
+ unlink(dest);
+
+ pg_fatal("error while cloning file \"%s\" to \"%s\": %s",
+ src, dest);
+ }
+ }
+#else
+ pg_fatal("file cloning not supported on this platform");
+#endif
+}
+
+static void
+copy_file_by_range(const char *src, const char *dest)
+{
+#if defined(HAVE_COPY_FILE_RANGE)
+ int src_fd;
+ int dest_fd;
+ ssize_t nbytes;
+
+ if ((src_fd = open(src, O_RDONLY | PG_BINARY, 0)) < 0)
+ pg_fatal("could not open file \"%s\": %m", src);
+
+ if ((dest_fd = open(dest, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+ pg_file_create_mode)) < 0)
+ pg_fatal("could not create file \"%s\": %m", dest);
+
+ do
+ {
+ nbytes = copy_file_range(src_fd, NULL, dest_fd, NULL, SSIZE_MAX, 0);
+ if (nbytes < 0)
+ pg_fatal("error while copying file range from \"%s\" to \"%s\": %m",
+ src, dest);
+ } while (nbytes > 0);
+
+ close(src_fd);
+ close(dest_fd);
+#else
+ pg_fatal("copy_file_range not supported on this platform");
+#endif
+}
+
+/* XXX maybe this should do the check internally, same as the other functions? */
#ifdef WIN32
static void
copy_file_copyfile(const char *src, const char *dst)
diff --git a/src/bin/pg_combinebackup/copy_file.h b/src/bin/pg_combinebackup/copy_file.h
index 0f6bc09403f..3a1c5eb764f 100644
--- a/src/bin/pg_combinebackup/copy_file.h
+++ b/src/bin/pg_combinebackup/copy_file.h
@@ -11,9 +11,25 @@
#ifndef COPY_FILE_H
#define COPY_FILE_H
+#include "c.h"
#include "common/checksum_helper.h"
+#include "common/file_utils.h"
+
+/*
+ * Enumeration to denote copy modes
+ */
+typedef enum CopyMode
+{
+ COPY_MODE_CLONE,
+ COPY_MODE_COPY,
+ COPY_MODE_COPY_FILE_RANGE,
+#ifdef WIN32
+ COPY_MODE_COPYFILE,
+#endif
+} CopyMode;
extern void copy_file(const char *src, const char *dst,
- pg_checksum_context *checksum_ctx, bool dry_run);
+ pg_checksum_context *checksum_ctx, bool dry_run,
+ CopyMode copy_mode);
#endif /* COPY_FILE_H */
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 74f8be9eeac..b6e1e62e160 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -69,6 +69,7 @@ typedef struct cb_options
pg_checksum_type manifest_checksums;
bool no_manifest;
DataDirSyncMethod sync_method;
+ CopyMode copy_method;
} cb_options;
/*
@@ -129,6 +130,8 @@ main(int argc, char *argv[])
{"manifest-checksums", required_argument, NULL, 1},
{"no-manifest", no_argument, NULL, 2},
{"sync-method", required_argument, NULL, 3},
+ {"clone", no_argument, NULL, 4},
+ {"copy-file-range", no_argument, NULL, 5},
{NULL, 0, NULL, 0}
};
@@ -156,6 +159,7 @@ main(int argc, char *argv[])
memset(&opt, 0, sizeof(opt));
opt.manifest_checksums = CHECKSUM_TYPE_CRC32C;
opt.sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+ opt.copy_method = COPY_MODE_COPY;
/* process command-line options */
while ((c = getopt_long(argc, argv, "dnNPo:T:",
@@ -192,6 +196,12 @@ main(int argc, char *argv[])
if (!parse_sync_method(optarg, &opt.sync_method))
exit(1);
break;
+ case 4:
+ opt.copy_method = COPY_MODE_CLONE;
+ break;
+ case 5:
+ opt.copy_method = COPY_MODE_COPY_FILE_RANGE;
+ break;
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -696,6 +706,8 @@ help(const char *progname)
" use algorithm for manifest checksums\n"));
printf(_(" --no-manifest suppress generation of backup manifest\n"));
printf(_(" --sync-method=METHOD set method for syncing files to disk\n"));
+ printf(_(" --clone clone (reflink) instead of copying files\n"));
+ printf(_(" --copy-file-range copy using copy_file_range() syscall\n"));
printf(_(" -?, --help show this help, then exit\n"));
printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
@@ -937,7 +949,8 @@ process_directory_recursively(Oid tsoid,
&checksum_length,
&checksum_payload,
opt->debug,
- opt->dry_run);
+ opt->dry_run,
+ opt->copy_method);
}
else
{
@@ -993,7 +1006,8 @@ process_directory_recursively(Oid tsoid,
/* Actually copy the file. */
snprintf(ofullpath, MAXPGPATH, "%s/%s", ofulldir, de->d_name);
- copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run);
+ copy_file(ifullpath, ofullpath, &checksum_ctx, opt->dry_run,
+ opt->copy_method);
/*
* If copy_file() performed a checksum calculation for us, then
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 41f06bb26b5..f5c7af8a23c 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -90,7 +90,8 @@ reconstruct_from_incremental_file(char *input_filename,
int *checksum_length,
uint8 **checksum_payload,
bool debug,
- bool dry_run)
+ bool dry_run,
+ CopyMode copy_method)
{
rfile **source;
rfile *latest_source = NULL;
@@ -319,7 +320,7 @@ reconstruct_from_incremental_file(char *input_filename,
*/
if (copy_source != NULL)
copy_file(copy_source->filename, output_filename,
- &checksum_ctx, dry_run);
+ &checksum_ctx, dry_run, copy_method);
else
{
write_reconstructed_file(input_filename, output_filename,
diff --git a/src/bin/pg_combinebackup/reconstruct.h b/src/bin/pg_combinebackup/reconstruct.h
index 8e33a8a95a0..726f94389f3 100644
--- a/src/bin/pg_combinebackup/reconstruct.h
+++ b/src/bin/pg_combinebackup/reconstruct.h
@@ -13,7 +13,9 @@
#ifndef RECONSTRUCT_H
#define RECONSTRUCT_H
+#include "c.h"
#include "common/checksum_helper.h"
+#include "common/file_utils.h"
#include "load_manifest.h"
extern void reconstruct_from_incremental_file(char *input_filename,
@@ -28,6 +30,7 @@ extern void reconstruct_from_incremental_file(char *input_filename,
int *checksum_length,
uint8 **checksum_payload,
bool debug,
- bool dry_run);
+ bool dry_run,
+ CopyMode copy_mode);
#endif
--
2.44.0