On Thu, Feb 21, 2013 at 4:04 AM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote:
> I'd like to see the contents of the WAL, starting from the last > checkpoint, up to the point where failover happened. In particular, any > actions on the relation base/16385/16430, which caused the error. > pg_controldata output on the base backup would also interesting, as well as > the contents of backup_label file. > > How long did the standby run between the base backup and the failover? How > many WAL segments? > > One more thing you could try to narrow down the error: restore from the > base backup, and let it run up to the point of failover, but shut it down > just before the failover with "pg_ctl stop -m fast". That should create a > restartpoint, at the latest checkpoint record. Then restart, and perform > failover. If it still throws the same error, we know that the WAL record > that touched the page that doesn't exist was after the last checkpoint. > Unfortunately, it looks like we lost the bad wal segments and necessary base backup due to our archiving mechanism. We don't yet have a principled way of saving systems for forensics. I thought I had manually accounted for everything to keep this "on ice" but I missed a step and the system was archived. I apologize. I'll see if I can add something for us to better support this. For what it's worth the failover was done at 2013-02-14 23:55:44 +0000 and the base backup used was dated 2013-02-15 00:49:22 +0000. I'll follow up in case we run into this again.