Bonjour Michaƫl,
The attached patch reorders the cluster fsyncing and control file changes in
"pg_rewind" so that the later is done after all data are committed to disk,
so as to reflect the actual cluster status, similarly to what is done by
"pg_checksums", per discussion in the thread about offline enabling of
checksums:
It would be an interesting property to see that it is possible to
retry a rewind of a node which has been partially rewound already,
but the operation failed in the middle.
Yes. I understand that the question is whether the Warning in pg_rewind
documentation can be partially lifted. The short answer is that it is not
obvious.
Because that's the real deal here: as long as we know that its control
file is in its previous state, we can rely on it for retrying the
operation. Logically, I think that it should work, because we would
still try to fetch the same blocks from the source server since WAL has
forked by looking at the records of the target up from the last
checkpoint before WAL has forked up to the last shutdown checkpoint, and
the operation is lossy by design when it comes to deal with file
differences.
Have you tried to see if pg_rewind is able to repeat its operation for
specific scenarios?
I have run the non regression tests. I'm not sure of what scenarii are
covered there, but probably not an interruption in the middle of a fsync,
specially if fsync is usually disabled to ease the tests:-)
One is for example a database created on the promoted standby, used as a
source, and a second, different database created on the primary after
the standby has been promoted. You could make the tool exit() before
the rewind finishes, just before updating the control file, and see if
the operation is repeatable. Interrupting the tool would be fine as
well, still less controllable.
It would be good to mention in the patch why the order matters.
Yep. This requires a careful analysis of pg_rewind inner working, that I
do not have to do in the short terme.
--
Fabien.