Just to followup on this in case others have this issue. I was able to rebuild bacula with the -g compiler option to get some debugging information. The scenario that causes the SD to crash with a SEGFAULT is not consistently reproducible which makes me think of some kind of race condition. But in any event, I was able to finally get a trace in gdb and the crash occurs in the same spot that others have reported in the URLs referenced below - namely in the deflate zlib method being called from openssl. The solution, I'm hoping, if you're using TLS, is to turn TLS off for communication between the director and the storage daemon (and to do this, you want to comment out all of your TLS options in any Storage definitions in the Director configuration and just the Director definition in the SD configuration). In addition, I also was able to set up the Director so if the SD does die, it would take care of restarting it and any failed jobs would be re-queued (using the Reschedule on Error options).
thanks again, --tom > Hi, >> >> We've been seeing our Bacula Storage Daemon die with a segmentation >> fault when a client can't be reached for backup. We have two servers >> and have observed this behavior on both of them. Some searching has >> revealed that others seem to have (or had) this same issue. >> >> https://bugs.launchpad.net/ubuntu/+source/bacula/+bug/622742 > > That looks similar to some existing bacula bug reports: > > http://bugs.bacula.org/view.php?id=1568 > http://bugs.bacula.org/view.php?id=1343 > > >> The behavior is not consistent i.e. sometimes it continues on working >> normally if a client can't be contacted but eventually it'll snag on one >> and die. In addition, I've now had one of our storage daemons running >> in the foreground with debugging set to the max and of course, that one >> has now gone two days without seg faulting even though there have been >> half a dozen non-responsive clients. >> >> We're currently running 5.0.3 built from source for both clients and >> servers. I'm wondering if anyone else here has experienced this problem >> and/or has any pointers to a work around. While things can be set up to >> automatically restart the storage daemon if it dies, the main problem is >> that any backups Bacula was in the middle of doing end with an error and >> have to be manually rescheduled/run or just wait until the next time >> their job comes up to run. ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users