On 2/25/19 7:50 PM, Fujii Masao wrote:
On Mon, Feb 25, 2019 at 10:49 PM Laurenz Albe <laurenz.a...@cybertec.at> wrote:

I'm not playing devil's advocate here to annoy you.  I see the problems
with the exclusive backup, and I see how it can hurt people.
I just think that removing exclusive backup without some kind of help
like Andres sketched above will make people unhappy.

+1

Another idea is to improve an exclusive backup method so that it will never
cause such issue. What about changing an exclusive backup mode of
pg_start_backup() so that it creates something like backup_label.pending file
instead of backup_label? Then if the database cluster has backup_label.pending
file but not recovery.signal (this is the case where the database is recovered
just after the server crashes while an exclusive backup is in progress),
in this idea, the recovery using that database cluster always ignores
(or removes) backup_label.pending file and start replaying WAL from
the REDO location that pg_control file indicates. So this idea enables us to
work around the issue that an exclusive backup could cause.

It's an interesting idea.

On the other hand, the downside of this idea is that the users need to change
the recovery procedure. When they want to do PITR using the backup having
backup_label.pending, they need to not only create recovery.signal but also
rename backup_label.pending to backup_label. Rename of backup_label file
is brand-new step for their recovery procedure, and changing the recovery
procedure might be painful for some users. But IMO it's less painful than
removing an exclusive backup API at all.

Well, given that we have invalidated all prior recovery procedures in PG12 I'm not sure how big a deal that is. Of course, it's too late make a change like this for PG12.

Thought?

Here's the really obvious bad thing: if users do not update their procedures and we ignore backup_label.pending on startup then they will end up with a corrupt database because it will not replay from the correct checkpoint. If we error on the presence of backup_label.pending then we are right back to where we started.

I know there are backup solutions that rely on copying all required WAL to pg_xlog/pg_wal before starting recovery. Those solutions would silently break in this case and end up in corruption. If we require recovery.signal then we still have the current problem of the cluster not starting after a crash.

BTW, if recovery.signal is created but backup_label.pending is not renamed
(this is the case where the operator forgets to rename the file even though
she or he create recovery signal file, i.e., mis-configuration), I think that
the recovery should emit PANIC immediately with the HINT like
"HINT: rename backup_label.pening to backup_label if you want to do PITR".

This causes its own problems, as stated above.

Regards,
--
-David
da...@pgmasters.net

Reply via email to