On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz.a...@cybertec.at> wrote: > On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote: > > I played around with incremental backup yesterday and tried $subject > > > > The WAL summarizer is running on the standby server, but when I try > > to take an incremental backup, I get an error that I understand to mean > > that WAL summarizing hasn't caught up yet. > > > > I am not sure if that is working as designed, but if it is, I think it > > should be documented. > > I played with this some more. Here is the exact error message: > > ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but > this backup starts at 0/1967C190 > > By trial and error I found that when I run a CHECKPOINT on the primary, > taking an incremental backup on the standby works. > > I couldn't fathom the cause of that, but I think that that should either > be addressed or documented before v17 comes out.
I had a feeling this was going to be confusing. I'm not sure what to do about it, but I'm open to suggestions. Suppose you take a full backup F; replay of that backup will begin with a checkpoint CF. Then you try to take an incremental backup I; replay will begin from a checkpoint CI. For the incremental backup to be valid, it must include all blocks modified after CF and before CI. But when the backup is taken on a standby, no new checkpoint is possible. Hence, CI will be the most recent restartpoint on the standby that has occurred before the backup starts. So, if F is taken on the primary and then I is immediately taken on the standby without the standby having done a new restartpoint, or if both F and I are taken on the standby and no restartpoint intervenes, then CF=CI. In that scenario, an incremental backup is pretty much pointless: every single incremental file would contain 0 blocks. You might as well just use the backup you already have, unless one of the non-relation files has changed. So, except in that unusual corner case, the fact that the backup fails isn't really costing you anything. In fact, there's a decent chance that it's saving you from taking a completely useless backup. On the primary, this doesn't occur, because there, each new backup triggers a new checkpoint, so you always have CI>CF. The error message is definitely confusing. The reason I'm not sure how to do better is that there is a large class of errors that a user could make that would trigger an error of this general type. I'm guessing that attempting a standby backup with CF=CI will turn out to be the most common one, but I don't think it'll be the only one that ever comes up. The code in PrepareForIncrementalBackup() focuses on what has gone wrong on a technical level rather than on what you probably did to create that situation. Indeed, the server doesn't really know what you did to create that situation. You could trigger the same error by taking a full backup on the primary and then try to take an incremental based on that full backup on a time-delayed standby (or a lagging standby) whose replay position was behind the primary, i.e. CI<CF. More perversely, you could trigger the error by spinning up a standby, promoting it, taking a full backup, destroying the standby, removing the timeline history file from the archive, spinning up a new standby, promoting onto the same timeline ID as the previous one, and then trying to take an incremental backup relative to the full backup. This might actually succeed, if you take the incremental backup at a later LSN than the previous full backup, but, as you may guess, terrible things will happen to you if you try to use such a backup. (I hope you will agree that this would be a self-inflicted injury; I can't see any way of detecting such cases.) If the incremental backup LSN is earlier than the previous full backup LSN, this error will trigger. So, given all the above, what can we do here? One option might be to add an errhint() to the message. I had trouble thinking of something that was compact enough to be reasonable to include and yet reasonably accurate and useful, but maybe we can brainstorm and figure something out. Another option might be to add more to the documentation, but it's all so complicated that I'm not sure what to write. It feels hard to make something that is brief enough to be worth including, accurate enough to help more than it hurts, and understandable enough that people who run into this will be able to make use of it. I think I'm a little too close to this to really know what the best thing to do is, so I'm happy to hear suggestions from you and others. -- Robert Haas EDB: http://www.enterprisedb.com