Hello, Our customer hit another bug of pg_rewind with PG 9.5. The attached patch fixes this.
PROBLEM
========================================
After a long run of successful pg_rewind, the synchronized standby could not
catch up the primary forever, emitting the following message repeatedly:
LOG: XX000: could not read from log segment 000000060000028A00000031, offset
16384: No error
CAUSE
========================================
If the primary removes WAL files that pg_rewind is going to get, pg_rewind
leaves 0-byte WAL files in the target directory here:
[libpq_fetch.c]
case FILE_ACTION_COPY:
/* Truncate the old file out of the way, if any
*/
open_target_file(entry->path, true);
fetch_file_range(entry->path, 0,
entry->newsize);
break;
pg_rewind completes successfully, create recovery.conf, and then start the
standby in the target cluster. walreceiver receives WAL records and write them
to the 0-byte WAL files. Finally, xlog reader complains that he cannot read a
WAL page.
FIX
========================================
pg_rewind deletes the file when it finds that the primary has deleted it.
OTHER THOUGHTS
========================================
BTW, should pg_rewind really copy WAL files from the primary? If the sole
purpose of pg_rewind is to recover an instance to use as a standby, can
pg_rewind just remove all WAL files in the target directory, because the
standby can get WAL files from the primary and/or archive?
Related to this, shouldn't pg_rewind avoid copying more files and directories
like pg_basebackup? Currently, pg_rewind doesn't copy postmaster.pid,
postmaster.opts, and temporary files/directories (pg_sql_tmp/).
Regards
Takayuki Tsunakawa
pg_rewind_corrupt_wal.patch
Description: pg_rewind_corrupt_wal.patch
