pg_upgrade (and recovery) pitfalls
Hello -general! (didn't want to pester -hackers with this, as I don't know if I've been doing something terribly wrong ;-) Consider the following scenario/setup: - 4 DB servers in 2 DCs - 1 primary (in DC1) - 1 sync secondary (in other DC) - 2 async secondaries (distributed over DCs) - 1 server running BARMAN for backups/archiving - Puppet takes care of the PG config, recovery.conf etc. - ssh connections are only allowed and enabled to and from the BARMAN host/user, no ssh between the DB hosts. - as we switch between servers pretty often, BARMAN provides a single archive for all 4 servers. Only the primary does WAL archiving. - General requirements are: - *always* have one sync secondary online (no exceptions) - best possible availability ==> shortest possible downtime The approach for major upgrades is to upgrade 2 servers and see if everything works as expected, then - either follow up with the remaining secondaries or - promote one of them and carry on with the old version Since we always have 2 spare secondaries to fall back to, pg_upgrade in "--link" mode is a viable approach, as it's nice and fast. The "naive" idea was to shutdown all instances (starting with the primary to enable final syncs), run "pg_upgrade -k" on both the former primary and the former sync secondary, re-link the recovery.conf on the secondary, re-enable the "primary" IP and start both. D'oh! The secondary is complaining about a different cluster identifier: "FATAL: database system identifier differs between the primary and standby" (From looking at the code, I could not determine straight away when and how this identifier is generated, but I guess it is somewhere in initdb.c?) So, as we can't rsync (no ssh...), which would probably finish in a few seconds, a pg_basebackup is due. Which can be a PITA when the DB size is scraping on a TB and you have a single 1GB/sec connection. Bye, bye, availability (remember the primary requirement?). ==> So by now, we're only pg_upgrade'ing the primary and follow up with a pg_basebackup to the secondary, planning for much longer downtimes. <== After finishing the pg_basebackup, re-link the recovery.conf, start. The recovery finds history-files from higher timelines in the archive, starts to recover those (?) and then complains that the timeline doesn't match (don't take the numbers here too seriously, this is from a low-traffic test system, the fork off TL 1 was at least a year ago): restored log file "0002.history" from archive restored log file "0003.history" from archive restored log file "0004.history" from archive FATAL: requested timeline 3 is not a child of this server's history DETAIL: Latest checkpoint is at 9C/36044D28 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 69/8800. This mess can probably be cleaned up manually (delete the 00[234].history etc. on both the secondary and the BARMAN archive), however to be 100% safe (or when you're unexperienced), you take another basebackup :/ And - after moving the *.history files out of the way in the archive - the secondary finally starts and starts receiving WALs. Sidenote: this second problem happened to us as well after a colleague promoted a secondary for some R/W tests w/out first disabling the archive_command in the morning and then re-built it using pg_basebackup in the afternoon. I have to say that it took me quite a while to figure out what was going on there... sudden timeline jumps, presumably out of nowhere?!? Now, questions/wishes: - did I do something obvious(ly) wrong? - why does a recovery, based on a recovery.conf that points to a reachable primary (which obviously communicates its own timeline), still look for higher timelines' history-files in the archive and tries to jump onto these timelines? This doesn't seem reasoable to me at all... - is there a way to have pg_upgrade/initdb use a particular "database system identifier" or some kind of "IV" for the new instance, allowing for identical upgrades on primary and secondary? Apart from that number (and the system tables' OIDs), the data directories should be identical, as far as I understand pg_upgrade's functionality? (- and where do I find that identifier?) - is there a way to have pg_upgrade/initdb initiate the new cluster on the same (or a higher) timeline, to prevent f***ing up the archive? If not, I'd certainly appreciate such an option! (I'm well aware that I need a new basebackup after the upgrade anyway, but alas!) - any other hints (apart from rsync or other ssh-based methods) how I could get around the pg_basebackup would be highly appreciated. Something that only clones the system tables and not the bulk of the (identical) data files? - is my situation (mandantory sync secondary) so unusual that nobody ever thought about the above? ;-) Cheers & best regards, Gunnar "Nick" Bluth
Re: pg_upgrade (and recovery) pitfalls
Stephen Frost – Thu, 16. August 2018 19:00 > Greetings, I salute you, Stephen! TL;DR: I blundered by not spotting an easter egg of my predecessors. > * PO (gunnar.bl...@pro-open.de) wrote: > > Consider the following scenario/setup: > > - 4 DB servers in 2 DCs > > - 1 primary (in DC1) > > - 1 sync secondary (in other DC) > > - 2 async secondaries (distributed over DCs) > > I'm a bit surprised that you're ok with the latency imposed by using > sync replication to another data center. I'm guessing they're pretty > close to each other? Yep, they are, as close as they're allowed to to fulfil regulatory requirements. Availability (and consistency) is far more relevant than performance, machines are at ~5% load most of the time. > > - General requirements are: > > - *always* have one sync secondary online (no exceptions) > > Well, you kind of have to or everything stops. ;) Guess why we have 4 servers in each cluster ;-) > > The "naive" idea was to shutdown all instances (starting with the primary to > enable final syncs), run "pg_upgrade -k" on both the former primary and the > former sync secondary, re-link the recovery.conf on the secondary, re-enable > the "primary" IP and start both. > > > > D'oh! The secondary is complaining about a different cluster identifier: > > "FATAL: database system identifier differs between the primary and standby" > > No, you can't do that. Well, yeah, learned that the hard way. It does make sense afterall, which is why I called the approach "naive" in the first place. > > (From looking at the code, I could not determine straight away when and how > this identifier is generated, but I guess it is somewhere in initdb.c?) > > Yes. > > > So, as we can't rsync (no ssh...), which would probably finish in a few > seconds, a pg_basebackup is due. Which can be a PITA when the DB size is > scraping on a TB and you have a single 1GB/sec connection. Bye, bye, > availability (remember the primary requirement?). > > The rsync *might* finish quickly but it depends a lot on the specifics > of your environment- for example, the rsync method doesn't do anything > for unlogged tables, so if you have large unlogged tables you can end up > with them getting copied over and that can take a long time, so, some > prep work should be done to make sure you nuke any unlogged tables > before you go through with the process (or do something similar). Nah, nothing sophisticated like unlogged tables here ;-))) > pg_basebackup has the unfortunate issue that it's single-threaded, > meaning that enabling compression probably will cause the system to > bottle-neck on the single CPU before reaching your 1Gb/s bandwidth > limit anyway. You could parallelize the backup/restore using pgbackrest > or, in recent versions I think, with barman, and that should at least > get you to be able to fill the 1Gb/s pipe with compressed data for the > backup. You're likely still looking at an hour or more though to get > all that data copied over that small a pipe. Yep, that's the pain about it > > > ==> So by now, we're only pg_upgrade'ing the primary and follow up with a > pg_basebackup to the secondary, planning for much longer downtimes. <== > > I have to say that I probably would argue that you should really have at > least two replicas in the same DC as the primary and then use > quorom-based syncronous replication. Presumably, that'd also increase > the bandwidth available to you for rebuilding the replica, reducing the > downtime associated with that. That might also get you to the point > where you could use the rsync method that's discussed in the pg_upgrade > docs to get the replicas back online. > > > After finishing the pg_basebackup, re-link the recovery.conf, start. > > The recovery finds history-files from higher timelines in the archive, > starts to recover those (?) and then complains that the timeline doesn't match > (don't take the numbers here too seriously, this is from a low-traffic test > system, the fork off TL 1 was at least a year ago): > > > > restored log file "0002.history" from archive > > restored log file "0003.history" from archive > > restored log file "0004.history" from archive > > FATAL: requested timeline 3 is not a child of this server's history > > DETAIL: Latest checkpoint is at 9C/36044D28 on timeline 1, but in the > history of the requested timeline, the server forked off from that timeline at > 69/8800. > > > > This mess can probably be cleaned up manually (delete