Hello, Thanks for the response.
No, it's nothing to do with mail configuration; 100% sure of that. (I know people say that all the time, but, seriously, it's the director). And by alerts, I do mean "Messages" in the bacula vernacular. The first time this crash happened, we received 120,000 Messages in the form of emails to our administrative account. The messages were identical both to each other and to the content of the $JOB.mail file in our bacula working directory (which is never removed automatically after one of these crashes - perhaps that causes the endless cycle). The same Message also appears to be written to our bacula log file each time an email is generated (or vice versa). It seems to me like it's possible for the director to get stuck in a loop and send the contents of that mail file again and again, infinitely. Both times we've had the SD crash (both have happened since upgrading to 5.0.1), the only thing that stopped the Message generation was stopping the director itself. Of course, that's the annoying symptom. The more serious problem is our the crash of our SD. Any pointers to getting "ptrace" working with the automatic scripts? thanks! Stephen On 04/15/2010 12:40 PM, Kern Sibbald wrote: > On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote: >> Additionally, seems like the SD was possibly reading a new >> freshly-labeled tape when it crashed... Last items in bacula log >> besides alerts already mentioned: > > In Bacula "alerts" refer to tape drive information stored concerning tape > problems, so I am assuming you mean messages. > >> >> >> 15-Apr 09:31 server-sd JobId 100000: Writing spooled data to Volume. >> Despooling 35,000,185,219 bytes ... >> 15-Apr 09:51 server-sd JobId 100000: End of Volume "FB0568" at 888:1414 >> on device "SL500-Drive-1" (/dev/nst0). Write of 262144 bytes got -1. >> 15-Apr 09:51 server-sd JobId 100000: Re-read of last block succeeded. >> 15-Apr 09:51 server-sd JobId 100000: End of medium on Volume "FB0568" >> Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51. >> 15-Apr 09:51 server-sd JobId 100000: 3307 Issuing autochanger "unload >> slot 38, drive 1" command. >> 15-Apr 09:52 server-sd JobId 100000: 3301 Issuing autochanger "loaded? >> drive 1" command. >> 15-Apr 09:52 server-sd JobId 100000: 3302 Autochanger "loaded? drive 1", >> result: nothing loaded. >> 15-Apr 09:52 server-sd JobId 100000: 3304 Issuing autochanger "load slot >> 39, drive 1" command. >> 15-Apr 09:52 server-sd JobId 100000: 3305 Autochanger "load slot 39, >> drive 1", status is OK. >> 15-Apr 09:52 server-sd JobId 100000: Volume "FB0569" previously written, >> moving to end of data. >> >> Nothing but thousands of 'repetitive' alerts after that... > > What exactly is repeated? > > There was a Bacula bug #1480 in message delivery that may be the same that you > are experiencing, it was triggered by a misconfigured SMTP server or by a > reference in Bacula to a non-existent SMTP server - and the simple solution > is to make sure Bacula points to a valid functional SMTP server. This > problem was not particular to version 5.0.1, but I think it was fixed after > the release of 5.0.1. Please see the bugs database for more details. > > Kern > >> >> thanks again, >> Stephen >> >> On 04/15/2010 10:25 AM, Stephen Thompson wrote: >>> Hello, >>> >>> I have just now experienced a possible new bug with bacula 5.0.1. >>> >>> The symptoms are this: >>> >>> bacula-sd crashes >>> bacula-dir continues to run >>> bacula-dir then spews out identical "Intervention needed" emails until >>> manually restarted >>> >>> The first time this happened over a weekend and upon returning I found >>> my inbox has about 120,000 bacula emails, all the SAME and of this type: >>> >>> "15-Apr 10:02 client-fd JobId 100001: Fatal error: backup.c:1048 Network >>> send error to SD. ERR=Broken pipe" >>> >>> It happened again just now (second time since upgrading from 3.0.3 to >>> 5.0.1) and I managed to stop the director with only a few thousand >>> emails going out. >>> >>> So there are really 2 issues here: >>> >>> 1) >>> Why does the director apparently get stuck in an infinite loop of >>> sending the same email message? Is this a known bug? >>> >>> 2) >>> Regarding the SD, I received one alert of this type, the rest like the >>> above: >>> >>> "15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT: >>> dev->blocked()" >>> >>> A traceback like: >>> -- >>> ptrace: Operation not permitted. >>> /var/bacula/work/29091: No such file or directory. >>> $1 = 0 >>> /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command >>> file: No symbol "exename" in current context. >>> -- >>> >>> And a bactrace like: >>> -- >>> Attempt to dump current JCRs >>> JCR=0x19a24888 JobId=100000 name=client_1.2010-04-14_18.02.33_41 >>> JobStatus=l use_count=1 >>> JobType=B JobLevel=F >>> sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35 >>> end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 >>> db=(nil) db_batch=(nil) batch_started=0 >>> JCR=0x1981b248 JobId=100001 name=client_10.2010-04-14_20.00.15_04 >>> JobStatus=R >>> use_count=1 >>> JobType=B JobLevel=I >>> sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15 >>> end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00 >>> db=(nil) db_batch=(nil) batch_started=0 >>> Attempt to dump plugins. Hook count=0 >>> -- >>> >>> Both clients and server seem healthy, except for the SD crash. >>> Any ideas? >>> >>> >>> thanks! >>> Stephen >>> >>> >>> ------------------------------------------------------------------------- >>> ------------ Further info: >>> >>> My catalog... >>> >>> mysql-5.0.77 (64bit) MyISAM >>> 210Gb in size >>> 1,412,297,215 records in File table >>> note: database built with bacula 2x scripts, >>> upgraded with 3x scripts, then again with 5x scripts >>> (i.e. nothing customized along the way) >>> >>> My OS& hardware for bacula DIR+SD server... >>> >>> Centos 5.4 (fully patched) >>> 8Gb RAM >>> 2Gb Swap >>> 1Tb EXT3 filesystem on external fiber RAID5 array >>> (dedicated to database, incl. temp files) >>> 2 dual-core [AMD Opteron(tm) Processor 2220] CPUs >>> StorageTek SL500 Library with 2 LTO3 Drives >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------- >>> ----- Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Bacula-devel mailing list >>> bacula-de...@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/bacula-devel > -- Stephen Thompson Berkeley Seismological Laboratory step...@seismo.berkeley.edu 215 McCone Hall # 4760 404.538.7077 (phone) University of California, Berkeley 510.643.5811 (fax) Berkeley, CA 94720-4760 ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users