On Tue, Apr 16, 2024 at 4:10 AM Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Apr 10, 2024 at 9:55 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> > To rescue my initdb --rel-segsize project[1] for v18, I will have a go
> > at making that dynamic.  It looks like we don't actually need to
> > allocate that until we get to the GetFileBackupMethod() call, and at
> > that point we have the file size.  If I understand correctly,
> > statbuf.st_size / BLCKSZ would be enough, so we could embiggen our
> > block number buffer there if required, right?
>
> Yes.

Here is a first attempt at that.  Better factoring welcome.  New
observations made along the way: the current coding can exceed
MaxAllocSize and error out, or overflow 32 bit size_t and allocate
nonsense.  Do you think it'd be better to raise an error explaining
that, or silently fall back to full backup (tried that way in the
attached), or that + log messages?  Another option would be to use
huge allocations, so we only have to deal with that sort of question
for 32 bit systems (i.e. effectively hypothetical/non-relevant
scenarios), but I don't love that idea.

> ...
> I do understand that a 1GB segment size is not that big in 2024, and
> that filesystems with a 2GB limit are thought to have died out a long
> time ago, and I'm not against using larger segments. I do think,
> though, that increasing the segment size by 32768x in one shot is
> likely to be overdoing it.

My intuition is that the primary interesting lines to cross are at 2GB
and 4GB due to data type stuff.  I defend against the most basic
problem in my proposal: I don't let you exceed your off_t type, but
that doesn't mean we don't have off_t/ssize_t/size_t/long snafus
lurking in our code that could bite a 32 bit system with large files.
If you make it past those magic numbers and your tools are all happy,
I think you should be home free until you hit file system limits,
which are effectively unhittable on most systems except ext4's already
bemoaned 16TB limit AFAIK.  But all the same, I'm contemplating
limiting the range to 1TB in the first version, not because of general
fear of unknown unknowns, but just because it means we don't need to
use "huge" allocations for this known place, maybe until we can
address that.
From 30efb6d39c83e9d4f338e10e1b8944c64f8799c5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Wed, 15 May 2024 17:23:21 +1200
Subject: [PATCH v1] Limit block number buffer size in incremental backup.

Previously, basebackup.c would allocate an array big enough to hold a
block number for every block in a full sized md.c segment file.  That
works out to 512kB by default, which should be no problem.  However,
users can change the segment size at compile time, and the required
space for the array could be many gigabytes, posing problems:

1. For segment sizes over 2TB, MaxAllocSize would be exceeded,
   raising an error.
2. For segment sizes over 8TB, size_t arithmetic would overflow on 32 bit
   systems, leading to the wrong size allocation.
3. For any very large segment size, it's non-ideal to allocate a huge
   amount of memory if you're not actually going to use it.

This isn't a fundamental fix for the high memory requirement of the
algorithm as coded, but it seems like a good idea to avoid those limits
with a fallback strategy, and defer allocating until we see how big the
segments actually are in the cluster being backed up, upsizing as
required.

These are mostly theoretical problems at the moment, because it is so
hard to change the segment size in practice that people don't do it.  A
new proposal would make it changeable at run time, hence interest in
tidying these rough edges up.

Discussion: https://postgr.es/m/CA%2BhUKG%2B2hZ0sBztPW4mkLfng0qfkNtAHFUfxOMLizJ0BPmi5%2Bg%40mail.gmail.com

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 9a2bf59e84e..0703b77af94 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -34,6 +34,7 @@
 #include "pgstat.h"
 #include "pgtar.h"
 #include "port.h"
+#include "port/pg_bitutils.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walsummarizer.h"
 #include "replication/walsender.h"
@@ -45,6 +46,7 @@
 #include "storage/reinit.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
@@ -1198,13 +1200,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 	bool		isGlobalDir = false;
 	Oid			dboid = InvalidOid;
 	BlockNumber *relative_block_numbers = NULL;
-
-	/*
-	 * Since this array is relatively large, avoid putting it on the stack.
-	 * But we don't need it at all if this is not an incremental backup.
-	 */
-	if (ib != NULL)
-		relative_block_numbers = palloc(sizeof(BlockNumber) * RELSEG_SIZE);
+	size_t		relative_block_numbers_size = 0;
 
 	/*
 	 * Determine if the current path is a database directory that can contain
@@ -1477,6 +1473,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 			unsigned	truncation_block_length = 0;
 			char		tarfilenamebuf[MAXPGPATH * 2];
 			char	   *tarfilename = pathbuf + basepathlen + 1;
+			size_t		stat_blocks;
 			FileBackupMethod method = BACK_UP_FILE_FULLY;
 
 			if (ib != NULL && isRelationFile)
@@ -1499,6 +1496,48 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
 					lookup_path = pstrdup(tarfilename);
 				}
 
+				/* If we have a block number array but it's too small, free. */
+				stat_blocks = statbuf.st_size / BLCKSZ;
+				if (relative_block_numbers &&
+					relative_block_numbers_size < stat_blocks)
+				{
+					pfree(relative_block_numbers);
+					relative_block_numbers = NULL;
+				}
+
+				/*
+				 * Make a new block number buffer if we need one, being
+				 * careful to avoid allocating memory for unexpectedly
+				 * oversized files. Also avoid exceeding the artifical
+				 * allocation size cap MaxAllocSize.
+				 *
+				 * We could use huge allocations to get around that, and if we
+				 * consider changing that, we should remember to clamp instead
+				 * for SIZE_MAX, for the sake of 32 bit systems.
+				 *
+				 * If we decline to allocate an array here,
+				 * GetFileBackupMethod will force a full backup.
+				 */
+				if (!relative_block_numbers &&
+					stat_blocks > 0 &&
+					stat_blocks <= RELSEG_SIZE &&
+					stat_blocks * sizeof(BlockNumber) <= MaxAllocSize)
+				{
+					uint64		size;
+
+					/*
+					 * Round up to next power of two, to amortize reallocation
+					 * overheads.  Clamp so we don't exceed the maximum
+					 * potential useful size, or the allocation limit.
+					 */
+					size = pg_nextpower2_64(stat_blocks);
+					size = Min(RELSEG_SIZE, size);
+					size = Min(MaxAllocSize / sizeof(BlockNumber), size);
+
+					relative_block_numbers = palloc(sizeof(BlockNumber) * size);
+					relative_block_numbers_size = size;
+				}
+
 				method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
 											 relfilenumber, relForkNum,
 											 segno, statbuf.st_size,
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 4ad9d7cad00..fe776e2b2de 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -736,7 +736,7 @@ GetIncrementalFilePath(Oid dboid, Oid spcoid, RelFileNumber relfilenumber,
  * an incremental file in the backup instead of the entire file. On return,
  * *num_blocks_required will be set to the number of blocks that need to be
  * sent, and the actual block numbers will have been stored in
- * relative_block_numbers, which should be an array of at least RELSEG_SIZE.
+ * relative_block_numbers, which should be an array of at least size / BLCKSZ.
  * In addition, *truncation_block_length will be set to the value that should
  * be included in the incremental file.
  */
@@ -768,10 +768,12 @@ GetFileBackupMethod(IncrementalBackupInfo *ib, const char *path,
 	Assert(RelFileNumberIsValid(relfilenumber));
 
 	/*
-	 * If the file size is too large or not a multiple of BLCKSZ, then
+	 * If the file size is too large or not a multiple of BLCKSZ, or the
+	 * caller was unable to allocate a suitably sized block number array, then
 	 * something weird is happening, so give up and send the whole file.
 	 */
-	if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE)
+	if ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE ||
+		!relative_block_numbers)
 		return BACK_UP_FILE_FULLY;
 
 	/*
-- 
2.39.2

Reply via email to