I got a simultaneous autovacuum and postmaster crash, then another postmaster crash a few seconds later on restart. After a third restart the system was stable. I haven't found anything obviously matching this defect on-line.
I'm aware there's not an enormous amount of info to go on here - I cannot reproduce the problem - but I'm hoping someone will recognize the defect, and if it's fixed somewhere please tell me where. I couldn't find a match in the bug lists. Here are the details I have... Two crashes happened simultaneously (postmaster + autovacuum process for the same postmaster instance), the third about 5 seconds later when restarting the postmaster. The second attempt to restart the postmaster, several hours later by hand, succeeded. We have three postmasters running, on different ports, serving different databases. Only one, 5432 seems to have been affected. This particular system is a deployment test VM (Virtualbox 4.0.8), which has been stable. It is possible the VM is actually at fault, not PostgreSQL. However, since I don't have a pattern of crashes and nothing is obvious there either, I can neither rule it in nor out. I have not been able to reproduce the problem. After the two crashes, the system recovered, and has not failed again in the last two weeks. I have tried tinkering with the amount of shared memory, loading the system down, etc., but have not reproduced it. I have a couple dozen other systems with identical software configurations but using non-virtualized hardware, all x86-64 of various sorts with 2 to 8 cores, running Centos 6, PostgreSQL 8.4.7. I also have a number of other virtual machines running the same loads. None have reproduced this problem. The virtual machine was lightly loaded. It's running Centos 6, 2.6.32-131, 64 bit, PostgreSQL 8.4.7. libc rpm version is glibc- 2.12-1.7.el6_0.5.x86_64. It has 2 GB of memory. The underlying box (the one running the VM) is an i5 (e.g. amd64 architecture), 2.8 GHz, and has 4 GB of memory. It’s running Ubuntu 10.04, not Centos. I am aware that I am running 8.4.7, not 8.4.10 or 9.x. The postgresql packages came from CentOS: postgresql-contrib-8.4.7-1.el6_0.1.x86_64 postgresql-server-8.4.7-1.el6_0.1.x86_64 postgresql-libs-8.4.7-1.el6_0.1.x86_64 postgresql-8.4.7-1.el6_0.1.x86_64 I don't know what specific queries were running at the time of the failure, since nothing useful was logged either by postmaster or by the OS. Neither logged anything within roughly two hours, other than occasional pgstat wait timeouts and the crash. 2011-12-03 01:42:00.913 UTC,,,20562,,4ed97e63.5052,1,,2011-12-03 01:41:55 UTC,1/0,0,WARNING,01000,"pgstat wait timeout",,,,,,,, 2011-12-03 05:37:56.506 UTC,,,23937,,4ed94d35.5d81,2,,2011-12-02 22:12:05 UTC,,0,LOG,00000,"server process (PID 12604) was terminated by signal 7: Bus error",,,,,,,, 2011-12-03 05:37:56.520 UTC,,,23937,,4ed94d35.5d81,3,,2011-12-02 22:12:05 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,, 2011-12-03 05:37:56.575 UTC,,,23937,,4ed94d35.5d81,4,,2011-12-02 22:12:05 UTC,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,, Note the first attempt to restart the postmaster, which also crashed, did not log anything. It appears the system had sufficient memory, normal amounts of shared memory available, etc, but sar samples are not that frequent and these crashes were several minutes after the immediately preceding sample, so if something happened between samples to make shared memory or some other system resource unavailable I'd have trouble finding it out posthumously. Non-default configuration changes: max_connections=200 listen_addresses='localhost,192.168.99.33' shared_buffers=4000 wal_buffers=1024 log_destination='csvlog,syslog' logging_collector='on' log_directory='pg_log' log_filename='postgresql-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size=0 syslog_facility='LOCAL0' syslog_ident='postgres' log_min_duration_statement=2147483 log_line_prefix='%m:%h:%u:%d\ ' log_autovacuum_min_duration=1000 autovacuum_naptime='2min' datestyle = 'iso, mdy' Here are the two concurrent stack traces, one for the postmaster and one for the autovacuum worker. I don't know which one of the two truly crashed first. fstat tells me the creation times are identical. This is core 12604, crashed at the same time as core 12603 Core was generated by `postgres: autovacuum worker process '. Program terminated with signal 7, Bus error. #0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180 3180 return guc_name_compare(confa->name, confb->name); (gdb) bt #0 guc_var_compare (a=0x7fff5fee3658, b=0xc54190) at guc.c:3180 #1 0x00007f62aa8dc300 in bsearch () from /lib64/libc.so.6 #2 0x00000000006ab49e in find_option (name=0x7bb622 "zero_damaged_pages", create_placeholders=1 '\001', elevel=20) at guc.c:3134 #3 0x00000000006affd5 in set_config_option (name=0x7bb622 "zero_damaged_pages", value=0x6d1879 "false", context=PGC_SUSET, source=PGC_S_OVERRIDE, action=GUC_ACTION_SET, changeVal=1 '\001') at guc.c:4537 #4 0x00000000005b4dee in AutoVacWorkerMain (argc=<value optimized out>, argv=<value optimized out>) at autovacuum.c:1554 #5 0x00000000005b503b in StartAutoVacWorker () at autovacuum.c:1438 #6 0x00000000005be465 in StartAutovacuumWorker (postgres_signal_arg=<value optimized out>) at postmaster.c:4404 #7 sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:4164 #8 <signal handler called> #9 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6 #10 0x00000000005bc522 in ServerLoop () at postmaster.c:1347 #11 0x00000000005bf0ec in PostmasterMain (argc=<value optimized out>,argv=<value optimized out>) at postmaster.c:1040 #12 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188 (gdb) This is 12603 Core was generated by `/usr/bin/postmaster -i -p 5432 -D /db/postgres'. Program terminated with signal 7, Bus error. #0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554 554 if (size > set->allocChunkLimit) (gdb) bt #0 AllocSetAlloc (context=0xc4ff90, size=293) at aset.c:554 #1 0x00000000006b4f61 in MemoryContextAllocZero (context=0xc4ff90, size=293) at mcxt.c:534 #2 0x00000000005bb511 in ProcessStartupPacket (port=0xc71ac0, SSLdone=0 '\000') at postmaster.c:1543 #3 0x00000000005bbef0 in BackendInitialize (port=0xc71ac0) at postmaster.c:3329 #4 0x00000000005bc936 in BackendStartup () at postmaster.c:3078 #5 ServerLoop () at postmaster.c:1387 #6 0x00000000005bf0ec in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1040 #7 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188 (gdb) Then 5 seconds later, on postmaster restart: Core was generated by `/usr/bin/postmaster -i -p 5432 -D /db/postgres' Program terminated with signal 7, Bus error. #0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214 214 while (context->firstchild != NULL) (gdb) bt #0 MemoryContextDeleteChildren (context=0xc501b0) at mcxt.c:214 #1 0x00000000006b4c79 in MemoryContextDelete (context=0xc501b0) at mcxt.c:169 #2 0x00000000005de838 in InitLocks () at lock.c:343 #3 0x00000000005d7995 in CreateSharedMemoryAndSemaphores (makePrivate=0 '\000', port=5432) at ipci.c:192 #4 0x00000000005bd63b in reset_shared () at postmaster.c:2029 #5 PostmasterStateMachine () at postmaster.c:2933 #6 0x00000000005bda05 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2475 #7 <signal handler called> #8 0x00007f62aa982423 in __select_nocancel () from /lib64/libc.so.6 #9 0x00000000005bc522 in ServerLoop () at postmaster.c:1347 #10 0x00000000005bf0ec in PostmasterMain (argc=<value optimized out>,argv=<value optimized out>) at postmaster.c:1040 #11 0x000000000056a4c0 in main (argc=6, argv=0xc4d3d0) at main.c:188 Thanks -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs