Hi hackers,

Investigating one of customer's support cases I found out that walsender is not calculating WAL records CRC and send them to replicas without any checks.
As a result damaged WAL record causes errors on all replicas:

        LOG: incorrect resource manager data checksum in record at 5FB9/D199F7D8
        FATAL: terminating walreceiver process due to administrator command

I wonder if it will be better to detect this problem earlier at master?
We can try to recover damaged WAL record (it is not always possible, but...)
Or at least do not advance replication slots and make it possible for DBA to restore corrupted WAL segment from archive and resume replication.

And right now the only choice is to restore replicas using basebackup which may take significant amount of time (for larger database).
And during this time master will not be protected from failures.

Or extra overhead of computing CRC in WAL sender is assumed to be to high?

Sorry, if this question was already discussed - I failed to find it in the archive.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to