We are running postgres 9.6, with 10+TB size. Backups have been taken using a homegrown tool "pgrsync", which uses S3 as the repository. Both the backup files and WAL archives are stored on S3.
Problem: While trying to restore, the WAL archives restore randomly fails for some backups with the following message in logs: 2020-11-12 06:33:32 UTC [10037]: [27988-1] user=,db=LOG: redo done at 5493D/2EFFF568 2020-11-12 06:33:32 UTC [10037]: [27989-1] user=,db=LOG: last completed transaction was at log time 2020-11-06 12:31:27.796805+00 2020-11-12 06:33:34 UTC [10037]: [27990-1] user=,db=LOG: restored log file "000000020005493D0000002E" from archive 2020-11-12 06:33:34 UTC [10037]: [27991-1] user=,db=FATAL: WAL ends before end of online backup 2020-11-12 06:33:34 UTC [10037]: [27992-1] user=,db=HINT: All WAL generated while online backup was taken must be available at recovery. 2020-11-12 06:33:36 UTC [10033]: [3-1] user=,db=LOG: startup process (PID 10037) exited with exit code 1 2020-11-12 06:33:36 UTC [10033]: [4-1] user=,db=LOG: terminating any other active server processes 2020-11-12 06:33:48 UTC [10033]: [5-1] user=,db=LOG: database system is shut down In this case, the backup start location is 00000002000544C60000006B and stop location is 00000002000545210000008D, (based on pg_stop_backup() output) but it stops in between at 005493D and terminates the restore. If I redo the restore again, it stops exactly at the same point. Similar results from couple of more backups, while other backups successfully restores. It is possibly an indication of some specific WAL files got corrupted during the backup/restore process. Is that the correct interpretation? Questions: 1. Are there any ways of identifying if the WAL file is corrupted? 2. Is there a way to move ahead without losing data? (I am wary of using pg_resetxlog)