Hi Robert, On Wed, Dec 13, 2023 at 2:16 PM Robert Haas <robertmh...@gmail.com> wrote: > > > > > not even in case of an intervening > > > timeline switch. So, all of the errors in this function are warning > > > you that you've done something that you really should not have done. > > > In this particular case, you've either (1) manually removed the > > > timeline history file, and not just any timeline history file but the > > > one for a timeline for a backup that you still intend to use as the > > > basis for taking an incremental backup or (2) tried to use a full > > > backup taken from one server as the basis for an incremental backup on > > > a completely different server that happens to share the same system > > > identifier, e.g. because you promoted two standbys derived from the > > > same original primary and then tried to use a full backup taken on one > > > as the basis for an incremental backup taken on the other. > > > > > > > Okay, but please consider two other possibilities: > > > > (3) I had a corrupted DB where I've fixed it by running pg_resetwal > > and some cronjob just a day later attempted to take incremental and > > failed with that error. > > > > (4) I had pg_upgraded (which calls pg_resetwal on fresh initdb > > directory) the DB where I had cronjob that just failed with this error > > > > I bet that (4) is going to happen more often than (1), (2) , which > > might trigger users to complain on forums, support tickets. > > Hmm. In case (4), I was thinking that you'd get a complaint about the > database system identifier not matching. I'm not actually sure that's > what would happen, though, now that you mention it. >
I've played with with initdb/pg_upgrade (17->17) and i don't get DBID mismatch (of course they do differ after initdb), but i get this instead: $ pg_basebackup -c fast -D /tmp/incr2.after.upgrade -p 5432 --incremental /tmp/incr1.before.upgrade/backup_manifest WARNING: aborting backup due to backend exiting before pg_backup_stop was called pg_basebackup: error: could not initiate base backup: ERROR: timeline 2 found in manifest, but not in this server's history pg_basebackup: removing data directory "/tmp/incr2.after.upgrade" Also in the manifest I don't see DBID ? Maybe it's a nuisance and all I'm trying to see is that if an automated cronjob with pg_basebackup --incremental hits a freshly upgraded cluster, that error message without errhint() is going to scare some Junior DBAs. > > LGTM, all quick tests work from my end too. BTW: I have also scheduled > > the long/large pgbench -s 14000 (~200GB?) - multiple day incremental > > test. I'll let you know how it went. > > Awesome, thank you so much. OK, so pgbench -i -s 14440 and pgbench -P 1 -R 100 -c 8 -T 259200 did generate pretty large incrementals (so I had to abort it due to lack of space, I was expecting to see smaller incrementals so it took too much space). I initally suspected that the problem lies in the normal distribution of `\set aid random(1, 100000 * :scale)` for tpcbb that UPDATEs on big pgbench_accounts. $ du -sm /backups/backups/* /backups/archive/ 216205 /backups/backups/full 215207 /backups/backups/incr.1 216706 /backups/backups/incr.2 102273 /backups/archive/ So I verified the recoverability yesterday anyway - the pg_combinebackup "full incr.1 incr.2" took 44 minutes and later archive wal recovery and promotion SUCCEED. The 8-way parallel seqscan foir sum(abalance) on the pgbench_accounts and other tables worked fine. The pg_combinebackup was using 15-20% CPU (mostly on %sys), while performing mostly 60-80MB/s separately for both reads and writes (it's slow, but it's due to maxed out sequence I/O of the Premium on a small SSD on Azure). So i've launched another improved test (to force more localized UPDATEs) to see the more real-world space-effectiveness of the incremental backup: \set aid random_exponential(1, 100000 * :scale, 8) \set bid random(1, 1 * :scale) \set tid random(1, 10 * :scale) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; INSERT INTO pgbench_history (tid , bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END; But then... (and i have verified the low-IDs for :aid above).. same has happened: backups/backups$ du -sm /backups/backups/* 210229 /backups/backups/full 208299 /backups/backups/incr.1 208351 /backups/backups/incr.2 # pgbench_accounts has relfilenodeid 16486 postgres@jw-test-1:/backups/backups$ for L in 5 10 15 30 100 161 173 174 175 ; do md5sum full/base/5/16486.$L ./incr.1/base/5/16486.$L ./incr.2/base/5/16486.$L /var/lib/postgres/17/data/base/5/16486.$L ; echo; done 005c6bbb40fca3c1a0a819376ef0e793 full/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 ./incr.1/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 ./incr.2/base/5/16486.5 005c6bbb40fca3c1a0a819376ef0e793 /var/lib/postgres/17/data/base/5/16486.5 [.. all the checksums match (!) for the above $L..] c5117a213253035da5e5ee8a80c3ee3d full/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d ./incr.1/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d ./incr.2/base/5/16486.173 c5117a213253035da5e5ee8a80c3ee3d /var/lib/postgres/17/data/base/5/16486.173 47ee6b18d7f8e40352598d194b9a3c8a full/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a ./incr.1/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a ./incr.2/base/5/16486.174 47ee6b18d7f8e40352598d194b9a3c8a /var/lib/postgres/17/data/base/5/16486.174 82dfeba58b4a1031ac12c23f9559a330 full/base/5/16486.175 21a8ac1e6fef3cf0b34546c41d59b2cc ./incr.1/base/5/16486.175 2c3d89c612b2f97d575a55c6c0204d0b ./incr.2/base/5/16486.175 73367d44d76e98276d3a6bbc14bb31f1 /var/lib/postgres/17/data/base/5/16486.175 So to me, it looks like it copied anyway 174 out of 175 files lowering the effectiveness of that incremental backup to 0% .The commands to generate those incr backups were: pg_basebackup -v -P -c fast -D /backups/backups/incr.1 --incremental=/backups/backups/full/backup_manifest sleep 4h pg_basebackup -v -P -c fast -D /backups/backups/incr.2 --incremental=/backups/backups/incr1/backup_manifest The incrementals are being generated , but just for the first (0) segment of the relation? /backups/backups$ ls -l incr.2/base/5 | grep INCR -rw------- 1 postgres postgres 12 Dec 14 21:33 INCREMENTAL.112 -rw------- 1 postgres postgres 12 Dec 14 21:01 INCREMENTAL.113 -rw------- 1 postgres postgres 12 Dec 14 21:36 INCREMENTAL.1247 -rw------- 1 postgres postgres 12 Dec 14 21:38 INCREMENTAL.1247_vm [..note, no INCREMENTAL.$int.$segment files] -rw------- 1 postgres postgres 12 Dec 14 21:24 INCREMENTAL.6238 -rw------- 1 postgres postgres 12 Dec 14 21:17 INCREMENTAL.6239 -rw------- 1 postgres postgres 12 Dec 14 21:55 INCREMENTAL.827 # 16486 is pgbench_accounts /backups/backups$ ls -l incr.2/base/5/*16486* | grep INCR -rw------- 1 postgres postgres 14613480 Dec 14 21:00 incr.2/base/5/INCREMENTAL.16486 -rw------- 1 postgres postgres 12 Dec 14 21:52 incr.2/base/5/INCREMENTAL.16486_vm /backups/backups$ /backups/backups$ find incr* -name INCREMENTAL.* | wc -l 1342 /backups/backups$ find incr* -name INCREMENTAL.*_* | wc -l # VM or FSM 236 /backups/backups$ find incr* -name INCREMENTAL.*.* | wc -l # not a single >1GB single incremental relation 0 I'm quickly passing info and I haven't really looked at the code yet , but it should be somewhere around GetFileBackupMethod() and reproducible easily with that configure --with-segsize-blocks=X switch. -J.