On Fri, Feb 17, 2012 at 7:36 AM, Heikki Linnakangas <heikki.linnakan...@enterprisedb.com> wrote: > On 17.02.2012 07:27, Fujii Masao wrote: >> >> Got another problem: when I ran pg_stop_backup to take an online backup, >> it got stuck until I had generated new WAL record. This happens because, >> in the patch, when pg_stop_backup forces a switch to new WAL file, old >> WAL file is not marked as archivable until next new WAL record has been >> inserted, but pg_stop_backup keeps waiting for that WAL file to be >> archived. >> OTOH, without the patch, WAL file is marked as archivable as soon as WAL >> file switch occurs. >> >> So, in short, the patch seems to handle the WAL file switch logic >> incorrectly. > > > Yep. For a WAL-switch record, XLogInsert returns the location of the end of > the record, not the end of the empty padding space. So when the caller > flushed up to that point, it didn't flush the empty space and therefore > didn't notify the archiver. > > Attached is a new version, fixing that, and off-by-one bug you pointed out > in the slot wraparound handling. I also moved code around a bit, I think > this new division of labor between the XLogInsert subroutines is more > readable. > > Thanks for the testing!
Hi Heikki, Sorry for the week long radio silence, I haven't been able to find much time during the week. I'll try to extract my test case from it's quite messy testing harness and get a self-contained version, but it will probably take a week or two to do it. I can probably refactor it to rely just on Perl and the modules DBI, DBD::Pg, IO::Pipe and Storable. Some of those are not core Perl modules, but they are all common ones. Would that be a good option? I've tested your v9 patch. I no longer see any inconsistencies or lost transactions in the recovered database. But occasionally I get databases that fail to recover at all. It has always been with the exact same failed assertion, at xlog.c line 2154. I've only seen this 4 times out of 2202 cycles of crash and recovery, so it must be some rather obscure situation. LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/180001B0 LOG: unexpected pageaddr 0/15084000 in log file 0, segment 25, offset 540672 LOG: redo done at 0/19083FD0 LOG: last completed transaction was at log time 2012-02-17 11:13:50.369488-08 LOG: checkpoint starting: end-of-recovery immediate TRAP: FailedAssertion("!(((((((uint64) (NewPageEndPtr).xlogid * (uint64) (((uint32) 0xffffffff) / ((uint32) (16 * 1024 * 1024))) * ((uint32) (16 * 1024 * 1024))) + (NewPageEndPtr).xrecoff - 1)) / 8192) % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", File: "xlog.c", Line: 2154) LOG: startup process (PID 5390) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers