Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes: > On 05/30/2011 10:29 PM, Mathew Samuel wrote: > >> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too >> frequently (10 seconds apart) >> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration >> parameter "checkpoint_segments". >> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement >> timeout >> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze >> _zamboni.sl_log_1 >> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, >> it was already committed >> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was >> terminated by signal 6 > > Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by > statement_timeout, couldn't be aborted (assuming it was in fact > 1827110275) and then the backend crashed with a signal 6 (SIGABRT). > SIGABRT can be caused by an assertion failure, certain fatal aborts in the C > library caused by memory allocation errors, etc. > > Alas, while PostgreSQL may have dumped a core file I doubt there's any debug > information in your build. If you do find a core file for that process ID, it > might be worth checking for a debuginfo rpm just in case. > >> In fact those last 3 lines are repeated over and over again repeatedly >> until "UTC4115FATAL: the database system is in recovery mode" is >> logged for 4 hours. At some point, 4 hours later of course, it appears >> that the system recovers. > > Wow. Four hours recovery with default checkpoint settings. > > Is it possible that the server was completely overloaded and was swapping > heavily? That could explain why VACUUM timed out in the first place, and > would explain why it took such a long time to recover. Check your system logs > around the same time for other indications of excessive load, and check your > monitoring history if you have monitoring like Cacti or the like active. > > See if there's anything interesting in the kernel logs too. > > Just for completeness, can you send all non-commented-out, non-blank lines in > your postgresql.conf ? > > $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 > > -- > Craig Ringer
Thanks for your help, here are the results from running that you provided to me: $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1 listen_addresses='localhost' port=5432 max_connections=200 ssl=off shared_buffers = 24MB max_fsm_pages = 153600 log_line_prefix='%t%p' statement_timeout=60000 datestyle = 'iso, mdy' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' And as you anticipated, no core file was dumped on that system. It is quite possible that the system was under heavy load at the time as it is a heavily used system that would be prone to periods of strong use. No monitoring (Cacti or otherwise) was active as far as I can gather. Not that this is any sort of solution but would one approach be to increase the statement_timeout to greater than 1 minute? I realize that this may hide the symptom but if the customer does hit this issue again I'm just seeing if there is an option I can change to help prevent a reoccurrence. Cheers, Matt -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general