Hello, Today I bumped into an issue with pg_rewind which is not 100% clear where should be better fixed.
The first call of pg_rewind failed with the following message: servers diverged at WAL location A76/39E55338 on timeline 132 could not open file "/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E": No such file or directory could not find previous WAL record at A76/1EFFE620 Failure, exiting In order to avoid rebuilding the replica from scratch, we restored the missing file by calling restore_command (wal-g) and repeated the call of pg_rewind. The second time pg_rewind also failed, but the error looked differently: servers diverged at WAL location A76/39E55338 on timeline 132 rewinding from last common checkpoint at A76/1EF254B8 on timeline 132 could not remove file "/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024": No such file or directory Failure, exiting The second call left PGDATA in an inconsistent state (empty pg_control). A few words about where the pg_wal/.wal-g/prefetch/running/ is coming from: wal-g by default when fetching the WAL file is also trying to do a prefetch of a few next WAL files. For that it forks and the child process doing prefetch while the parent process exits. In order to avoid multiple parallel prefetches of the same file, wal-g keeps its state in the pg_wal/.wal-g directory. It also keeps prefetched files there. What in fact happened: pg_rewind is building a list of files in the target directory which don't match with the source directory and therefore must be changed (copied/removed/truncated/etc). When the list was built, the wal-g prefetch was still running, but when pg_rewind tried to remove files that should not be there because they don't exist in the source directory it failed with the fatal error. The issue occurred on 10.14, but I believe very similar might happen with postgres 13 when pg_rewind is called with --restore-target-wal option. One might argue that the issue should be fixed in wal-g (it should not mess up with pg_wal), and I personally 99% agree with that, but so far this behavior was safe, because postgres itself never looks inside unexpected directories in pg_wal. Also from the usability perspective it is very convenient to keep prefetched files in the pg_wal/.wal-g, because it guarantees 100% that they will be located on the same filesystem as pg_wal and therefore the next time when the restore_command is called it is enough just to rename the file. That made me think about how it could be improved in the pg_rewind. The thing is, that we want to have a specific file to be removed, and it is already not there. Should it be a fatal error? traverse_datadir()/recurse_dir() already ignoring all failed lstat() calls with errno == ENOENT. Basically I have to options: 1. Remove modify remove_target_file(const char *path, bool missing_ok) function and remove the missing_ok option, that would be consistent with recurse_dir() 2. Change the logic of remove_target_file(), so it doesn't exit with the fatal error if the file is missing, but shows only a warning. In addition to the aforementioned options the remove_target_dir() also should be improved, i.e. it should check errno and behave similarly to the remove_target_file() if the errno == ENOENT What do you think? Regards, -- Alexander Kukushkin