[BUGS] simultaneous autovacuum and postmaster crash

lsq Wed, 28 Dec 2011 05:53:50 -0800

I got a simultaneous autovacuum and postmaster crash, then another 
postmaster crash a few seconds later on restart.  After a third 
restart the system was stable.  I haven't found anything obviously 
matching this defect on-line.


I'm aware there's not an enormous amount of info to go on here - I 
cannot reproduce the problem - but I'm hoping someone will 
recognize the defect, and if it's fixed somewhere please tell me 
where.  I couldn't find a match in the bug lists.

Here are the details I have...

Two crashes happened simultaneously (postmaster + autovacuum 
process for the same postmaster instance), the third about 5 
seconds later when restarting the postmaster.  The second attempt 
to restart the postmaster, several hours later by hand, succeeded.  
We have three postmasters running, on different ports, serving 
different databases. Only one, 5432 seems to have been affected. 

This particular system is a deployment test VM (Virtualbox 4.0.8), 
which has been stable. It is possible the VM is actually at fault, 
not PostgreSQL.  However, since I don't have a pattern of crashes 
and nothing is obvious there either, I can neither rule it in nor 
out.

I have not been able to reproduce the problem.  After the two 
crashes, the system recovered, and has not failed again in the last 
two weeks.  I have tried tinkering with the amount of shared 
memory, loading the system down, etc., but have not reproduced it.  
I have a couple dozen other systems with identical software 
configurations but using non-virtualized hardware, all x86-64 of 
various sorts with 2 to 8 cores, running Centos 6, PostgreSQL 
8.4.7.  I also have a number of other virtual machines running the 
same loads.  None have reproduced this problem.  

The virtual machine was lightly loaded. It's running Centos 6, 
2.6.32-131, 64 bit, PostgreSQL 8.4.7. libc rpm version is glibc-
2.12-1.7.el6_0.5.x86_64.  It has 2 GB of memory. The underlying box 
(the one running the VM) is an i5 (e.g. amd64 architecture), 2.8 
GHz, and has 4 GB of memory. It’s running Ubuntu 10.04, not Centos. 

I am aware that I am running 8.4.7, not 8.4.10 or 9.x.

The postgresql packages came from CentOS:
postgresql-contrib-8.4.7-1.el6_0.1.x86_64
postgresql-server-8.4.7-1.el6_0.1.x86_64
postgresql-libs-8.4.7-1.el6_0.1.x86_64
postgresql-8.4.7-1.el6_0.1.x86_64

I don't know what specific queries were running at the time of the 
failure, since nothing useful was logged either by postmaster or by 
the OS.  Neither logged anything within roughly two hours, other 
than occasional pgstat wait timeouts and the crash.

2011-12-03 01:42:00.913 UTC,,,20562,,4ed97e63.5052,1,,2011-12-03 
01:41:55 UTC,1/0,0,WARNING,01000,"pgstat wait timeout",,,,,,,,
2011-12-03 05:37:56.506 UTC,,,23937,,4ed94d35.5d81,2,,2011-12-02 
22:12:05 UTC,,0,LOG,00000,"server process (PID 12604) was 
terminated by signal 7: Bus error",,,,,,,,
2011-12-03 05:37:56.520 UTC,,,23937,,4ed94d35.5d81,3,,2011-12-02 
22:12:05 UTC,,0,LOG,00000,"terminating any other active server 
processes",,,,,,,,
2011-12-03 05:37:56.575 UTC,,,23937,,4ed94d35.5d81,4,,2011-12-02 
22:12:05 UTC,,0,LOG,00000,"all server processes terminated; 
reinitializing",,,,,,,,

Note the first attempt to restart the postmaster, which also 
crashed, did not log anything.  

It appears the system had sufficient memory, normal amounts of 
shared memory available, etc, but sar samples are not that frequent 
and these crashes were several minutes after the immediately 
preceding sample, so if something happened between samples to make 
shared memory or some other system resource unavailable I'd have 
trouble finding it out posthumously.   

Non-default configuration changes:
max_connections=200
listen_addresses='localhost,192.168.99.33'
shared_buffers=4000
wal_buffers=1024
log_destination='csvlog,syslog'
logging_collector='on'
log_directory='pg_log'
log_filename='postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size=0
syslog_facility='LOCAL0'
syslog_ident='postgres'
log_min_duration_statement=2147483
log_line_prefix='%m:%h:%u:%d\ '
log_autovacuum_min_duration=1000
autovacuum_naptime='2min'
datestyle = 'iso, mdy'

Here are the two concurrent stack traces, one for the postmaster 
and one for the autovacuum worker. I don't know which one of the 
two truly crashed first. fstat tells me the creation times are 
identical.

This is core 12604, crashed at the same time as core 12603
Core was generated by `postgres: autovacuum worker process        '.
Program terminated with signal 7, Bus error.
#0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180
3180 return guc_name_compare(confa->name, confb->name);

(gdb) bt
#0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180
#1 0x00007f62aa8dc300 in bsearch () from /lib64/libc.so.6
#2 0x00000000006ab49e in find_option (name=0x7bb622 
"zero_damaged_pages", create_placeholders=1 '\001', elevel=20) at 
guc.c:3134
#3 0x00000000006affd5 in set_config_option (name=0x7bb622 
"zero_damaged_pages", value=0x6d1879 "false", context=PGC_SUSET, 
source=PGC_S_OVERRIDE, action=GUC_ACTION_SET, changeVal=1 '\001') 
at guc.c:4537
#4 0x00000000005b4dee in AutoVacWorkerMain (argc=<value optimized 
out>, argv=<value optimized out>) at autovacuum.c:1554
#5 0x00000000005b503b in StartAutoVacWorker () at autovacuum.c:1438
#6 0x00000000005be465 in StartAutovacuumWorker 
(postgres_signal_arg=<value optimized out>) at postmaster.c:4404
#7 sigusr1_handler (postgres_signal_arg=<value optimized out>) at 
postmaster.c:4164
#8 <signal handler called>
#9 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6
#10 0x00000000005bc522 in ServerLoop () at postmaster.c:1347
#11 0x00000000005bf0ec in PostmasterMain (argc=<value optimized 
out>,argv=<value optimized out>) at postmaster.c:1040
#12 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188
(gdb)

This is 12603
Core was generated by `/usr/bin/postmaster -i -p 5432 -D 
/db/postgres'.
Program terminated with signal 7, Bus error.
#0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554
554 if (size > set->allocChunkLimit)
(gdb) bt
#0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554
#1 0x00000000006b4f61 in MemoryContextAllocZero (context=0xc4ff90, 
size=293) at mcxt.c:534
#2 0x00000000005bb511 in ProcessStartupPacket (port=0xc71ac0, 
SSLdone=0 '\000') at postmaster.c:1543
#3 0x00000000005bbef0 in BackendInitialize (port=0xc71ac0) at 
postmaster.c:3329
#4 0x00000000005bc936 in BackendStartup () at postmaster.c:3078
#5 ServerLoop () at postmaster.c:1387
#6 0x00000000005bf0ec in PostmasterMain (argc=<value optimized 
out>, argv=<value optimized out>) at postmaster.c:1040
#7 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188
(gdb)

Then 5 seconds later, on postmaster restart:
Core was generated by `/usr/bin/postmaster -i -p 5432 -D 
/db/postgres'
Program terminated with signal 7, Bus error.
#0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214
214 while (context->firstchild != NULL)

(gdb) bt
#0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214
#1  0x00000000006b4c79 in MemoryContextDelete (context=0xc501b0) at 
mcxt.c:169
#2 0x00000000005de838 in InitLocks () at lock.c:343
#3 0x00000000005d7995 in CreateSharedMemoryAndSemaphores 
(makePrivate=0 '\000', port=5432) at ipci.c:192
#4 0x00000000005bd63b in reset_shared () at postmaster.c:2029
#5 PostmasterStateMachine () at postmaster.c:2933
#6 0x00000000005bda05 in reaper (postgres_signal_arg=<value 
optimized out>) at postmaster.c:2475
#7 <signal handler called>
#8 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6
#9 0x00000000005bc522 in ServerLoop () at postmaster.c:1347
#10 0x00000000005bf0ec in PostmasterMain (argc=<value optimized 
out>,argv=<value optimized out>) at postmaster.c:1040
#11 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188

Thanks


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] simultaneous autovacuum and postmaster crash

Reply via email to