pg_upgrade (and recovery) pitfalls

PO Thu, 16 Aug 2018 08:42:02 -0700

Hello -general!

(didn't want to pester -hackers with this, as I don't know if I've been doing 
something terribly wrong ;-)



Consider the following scenario/setup:
- 4 DB servers in 2 DCs
  - 1 primary (in DC1)
  - 1 sync secondary (in other DC)
  - 2 async secondaries (distributed over DCs)
  - 1 server running BARMAN for backups/archiving

- Puppet takes care of the PG config, recovery.conf etc.

- ssh connections are only allowed and enabled to and from the BARMAN 
host/user, no ssh between the DB hosts.

- as we switch between servers pretty often, BARMAN provides a single archive 
for all 4 servers. Only the primary does WAL archiving.

- General requirements are:
  - *always* have one sync secondary online (no exceptions)
  - best possible availability ==> shortest possible downtime



The approach for major upgrades is to upgrade 2 servers and see if everything 
works as expected, then 
- either follow up with the remaining secondaries or
- promote one of them and carry on with the old version

Since we always have 2 spare secondaries to fall back to, pg_upgrade in 
"--link" mode is a viable approach, as it's nice and fast.


The "naive" idea was to shutdown all instances (starting with the primary to 
enable final syncs), run "pg_upgrade -k" on both the former primary and the 
former sync secondary, re-link the recovery.conf on the secondary, re-enable 
the "primary" IP and start both.

D'oh! The secondary is complaining about a different cluster identifier:
"FATAL:  database system identifier differs between the primary and standby"

(From looking at the code, I could not determine straight away when and how 
this identifier is generated, but I guess it is somewhere in initdb.c?)

So, as we can't rsync (no ssh...), which would probably finish in a few 
seconds, a pg_basebackup is due. Which can be a PITA when the DB size is 
scraping on a TB and you have a single 1GB/sec connection. Bye, bye, 
availability (remember the primary requirement?).

==> So by now, we're only pg_upgrade'ing the primary and follow up with a 
pg_basebackup to the secondary, planning for much longer downtimes. <==


After finishing the pg_basebackup, re-link the recovery.conf, start. 
The recovery finds history-files from higher timelines in the archive, starts 
to recover those (?) and then complains that the timeline doesn't match (don't 
take the numbers here too seriously, this is from a low-traffic test system, 
the fork off TL 1 was at least a year ago):

restored log file "00000002.history" from archive
restored log file "00000003.history" from archive
restored log file "00000004.history" from archive
FATAL:  requested timeline 3 is not a child of this server's history
DETAIL:  Latest checkpoint is at 9C/36044D28 on timeline 1, but in the history 
of the requested timeline, the server forked off from that timeline at 
69/88000000.

This mess can probably be cleaned up manually (delete the 000000[234].history 
etc. on both the secondary and the BARMAN archive), however to be 100% safe (or 
when you're unexperienced), you take another basebackup :/

And - after moving the *.history files out of the way in the archive - the 
secondary finally starts and starts receiving WALs.

Sidenote: this second problem happened to us as well after a colleague promoted 
a secondary for some R/W tests w/out first disabling the archive_command in the 
morning and then re-built it using pg_basebackup in the afternoon.
I have to say that it took me quite a while to figure out what was going on 
there... sudden timeline jumps, presumably out of nowhere?!?


Now, questions/wishes:
- did I do something obvious(ly) wrong?

- why does a recovery, based on a recovery.conf that points to a reachable 
primary (which obviously communicates its own timeline), still look for higher 
timelines' history-files in the archive and tries to jump onto these timelines? 
This doesn't seem reasoable to me at all...

- is there a way to have pg_upgrade/initdb use a particular "database system 
identifier" or some kind of "IV" for the new instance, allowing for identical 
upgrades on primary and secondary? Apart from that number (and the system 
tables' OIDs), the data directories should be identical, as far as I understand 
pg_upgrade's functionality? 
  (- and where do I find that identifier?)

- is there a way to have pg_upgrade/initdb initiate the new cluster on the same 
(or a higher) timeline, to prevent f***ing up the archive? If not, I'd 
certainly appreciate such an option! (I'm well aware that I need a new 
basebackup after the upgrade anyway, but alas!)

- any other hints (apart from rsync or other ssh-based methods) how I could get 
around the pg_basebackup would be highly appreciated. Something that only 
clones the system tables and not the bulk of the (identical) data files?
- is my situation (mandantory sync secondary) so unusual that nobody ever 
thought about the above? ;-)

Cheers & best regards,

Gunnar "Nick" Bluth

pg_upgrade (and recovery) pitfalls

Reply via email to