On Wed, Mar 20, 2019 at 9:00 AM Michael Paquier <mich...@paquier.xyz> wrote: > On Wed, Mar 20, 2019 at 04:29:35PM +0400, David Steele wrote: > > Please note that there have been objections to the patch later in this > > thread by Peter and Robert. I'm not very interested in watering down the > > documentation changes as Peter suggests, but I think at the very least we > > should commit the added hints in the error message. For many users this > > error will be their first point of contact with the backup_label > > issue/behavior. > > The updates of the log message do not imply anything negative as I > read them as they mention to not remove the backup_label file. So > while we don't have an agreement about the docs, the log messages may > be able to be committed? Peter? Robert? > > "will result in a corruptED cluster" is more correct?
I really like the proposed changes to the ereport() text. I think the "Be careful" hint is a really helpful way of phrasing it. I think "corrupt" as the patch has it is slightly better than "corrupted". Obviously, we have to make the updates for recovery.signal no matter what, and you could argue that part should be its own commit, but I like all of the changes. I'm not too impressed with the documentation changes. A lot of the information being added is already present somewhere in that very same section. It's reasonable to revise the section so that the dangers stand out more clearly, but it doesn't seem good to revise it in a way that ends up duplicating the existing information. Here's my suggestion -- ditch the note at the end of the section and make the one at the beginning read like this: The exclusive backup method is deprecated and should be avoided. Prior to <productname>PostgreSQL</productname> 9.6, this was the only low-level method available, but it is now recommended that all users upgrade their scripts to use non-exclusive backups. Then, revise the first paragraph like this: The process for an exclusive backup is mostly the same as for a non-exclusive one, but it differs in a few key steps. This type of backup can only be taken on a primary and does not allow concurrent backups. Moreover, because it writes a backup_label file on the master, it can cause the master to fail to restart automatically after a crash. On the other hand, the erroneous removal of a backup_label file from a backup or standby is a common mistake which can can result in serious data corruption. If it is necessary to use this method, the following steps may be used. Later, where it says: Note that if the server crashes during the backup it may not be possible to restart until the <literal>backup_label</literal> file has been manually deleted from the <envar>PGDATA</envar> directory. Change it to read: As noted above, if the server crashes during the backup it may not be possible to restart until the <literal>backup_label</literal> file has been manually deleted from the <envar>PGDATA</envar> directory. Note that it is very important to never to remove the <literal>backup_label</literal> file when restoring a backup, because this will result in corruption. Confusion about the circumstances under which it is appropriate to remove this file is a common cause of data corruption when using this method; be very certain that you remove the file only on an existing master and never when building a standby or restoring a backup, even if you are building a standby that will subsequently be promoted to a new master. I also think we should revise this thoroughly terrible advice: If you wish to place a time limit on the execution of <function>pg_stop_backup</function>, set an appropriate <varname>statement_timeout</varname> value, but make note that if <function>pg_stop_backup</function> terminates because of this your backup may not be valid. That seems awful, not only because it encourages people to do that particular thing and end up leaving the server in backup mode, but also because it doesn't clearly articulate the extreme importance of making sure that the server is not left in backup mode. So I would propose that we strike that text entirely and replace it with something like: When using exclusive backup mode, it is absolutely imperative to make sure that <function>pg_stop_backup</function> completes successfully at the end of the backup. Even if the backup itself fails, for example due to lack of disk space, failure to call <function>pg_stop_backup</function> will leave the server in backup mode indefinitely, causing future backups to fail and increasing the risk of a restart during a time when <literal>backup_label</literal> exists. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company