Re: [GENERAL] UTC4115FATAL: the database system is in recovery mode

Mathew Samuel Tue, 31 May 2011 08:22:05 -0700

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> On 05/30/2011 10:29 PM, Mathew Samuel wrote:
> 
>> 2011-03-28 10:44:28 UTC3609HINT: Consider increasing the configuration 
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:38 UTC3609LOG: checkpoints are occurring too 
>> frequently (10 seconds apart)
>> 2011-03-28 10:44:38 UTC3609HINT: Consider increasing the configuration 
>> parameter "checkpoint_segments".
>> 2011-03-28 10:44:42 UTC3932ERROR: canceling statement due to statement 
>> timeout
>> 2011-03-28 10:44:42 UTC3932STATEMENT: vacuum full analyze 
>> _zamboni.sl_log_1
>> 2011-03-28 10:44:42 UTC3932PANIC: cannot abort transaction 1827110275, 
>> it was already committed
>> 2011-03-28 10:44:42 UTC3566LOG: server process (PID 3932) was 
>> terminated by signal 6
>
> Interesting. It almost looks like a VACUUM FULL ANALYZE was cancelled by 
> statement_timeout, couldn't be aborted (assuming it was in fact
> 1827110275) and then the backend crashed with a signal 6 (SIGABRT). 
> SIGABRT can be caused by an assertion failure, certain fatal aborts in the C 
> library caused by memory allocation errors, etc.
>
> Alas, while PostgreSQL may have dumped a core file I doubt there's any debug 
> information in your build. If you do find a core file for that process ID, it 
> might be worth checking for a debuginfo rpm just in case.
>
>> In fact those last 3 lines are repeated over and over again repeatedly 
>> until "UTC4115FATAL: the database system is in recovery mode" is 
>> logged for 4 hours. At some point, 4 hours later of course, it appears 
>> that the system recovers.
>
> Wow. Four hours recovery with default checkpoint settings.
>
> Is it possible that the server was completely overloaded and was swapping 
> heavily? That could explain why VACUUM timed out in the first place, and 
> would explain why it took such a long time to recover. Check your system logs 
> around the same time for other indications of excessive load, and check your 
> monitoring history if you have monitoring like Cacti or the like active.
>
> See if there's anything interesting in the kernel logs too.
>
> Just for completeness, can you send all non-commented-out, non-blank lines in 
> your postgresql.conf ?
>
> $ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1
>
> --
> Craig Ringer


Thanks for your help, here are the results from running that you provided to me:
$ egrep '^[^#[:space:]]' postgresql.conf |cut -d '#' -f 1

listen_addresses='localhost'
port=5432
max_connections=200
ssl=off
shared_buffers = 24MB
max_fsm_pages = 153600
log_line_prefix='%t%p'
statement_timeout=60000
datestyle = 'iso, mdy'
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

And as you anticipated, no core file was dumped on that system.

It is quite possible that the system was under heavy load at the time as it is 
a heavily used system that would be prone to periods of strong use. No 
monitoring (Cacti or otherwise) was active as far as I can gather.

Not that this is any sort of solution but would one approach be to increase the 
statement_timeout to greater than 1 minute? I realize that this may hide the 
symptom but if the customer does hit this issue again I'm just seeing if there 
is an option I can change to help prevent a reoccurrence.

Cheers,
Matt

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] UTC4115FATAL: the database system is in recovery mode

Reply via email to