On Tue, 11 Oct 2005, Arno Lehmann wrote:

Hello,

On 11.10.2005 08:35, Luke Dean wrote:


Hello, I just subscribed to the list, though I've been happily using Bacula for about a year now. Last month I ran into my first serious problem, and I'm not sure how to troubleshoot it.

I'd been using version 1.36.2 on an SMP machine running FreeBSD 5.4 (i386 platform) backing up several different machines on a network to a hardware RAID array. It worked great.

Then I decided to put the backup responsibilities on a different machine.
... upgrade to 1.36.3 on single-CPU FreeBSD 5.4 machine

Then the problems started.

Often (nearly always) whenever I'd attempt a full backup, the director daemon would (a) silently terminate (b) cause the system to hang or (c) reboot the system. There was never anything in the Bacula log, syslog, or the console message log. It doesn't matter if the job starts automatically or manually from bconsole. Liklihood of a problem seems directly proportional to the size of the fileset.

I'll remove the rest of your description - looks like you tried to rule out problems not related to Bacula.

My first impression was that there should be something OS- or hardware related. After all, a reboot without log entries etc. usually indicates that. Anyway, what you experience might prove hard to analyze.

Concerning bacula - I understand you are using file storage only and your backups are running rather unrelaibale right now. I'd suggest to upgrade to the current development version (1.37.40) and see if that fixes your problems (I guess it will not, but you never know). There were, as far as I remember, some deadlock problems in the 1.36 versions which should be fixed in 1.37.

An upgrade to 1.37.40 will require a catalog database change, but the configuration can remain (mostly) unchanged. Personally, I consider 1.37 stable since 1.37.3something, although it is not tested as thorougly as a relase version, of course. Anyway, even if this doesn't fix anything for you, you will not lose much considering the current situation :-)

Then it would seem useful to analyze the server crashes, reboot, and hangs.

The first step I'd take is to set up system logging to another host - that can sometimes catch the last log messages before or during a crash.

Then I'd suggest removing the new disk controller - that seems to be the only new hardware that can physically reset or hold your machine. Pull it out and use a test-setup for your backups. For example, set up disk volumes with very short retention times and limited size. Have them automatically recycled, and let some big jobs run on them. Of course, you will not be able to use these backups - they will overwrite their own data - but as far as I know you can (still) do this and it allows testing with limited disk space.

then run bacula with debug output enabled and capture the files, which, in case of a crash, might be difficult. NFS mount and syncronous writing could be one solution for the logging directory. See if you can determine if bacula always does the same when the server crashes.

And, of course, observe the temperature in your server and of your disks. I have an old machine I use as file server, and during normal operation without many accesses the disks report temperatures of more than 50 degrees (Celsius, of course). I wouldn't try to use that setup for high throughput applications...

Arno

I haven't tried 1.37 yet, but I did try several of your other suggestions.
I eventually figured out how to run the director inside the debugger, get some debugging information, and watch the machine lock up. What I saw was that the director tends to run in a loop where it talks to the other daemons, and occasionally gets interrupted with a scheduling routine. The system freeze would happen whenever the director tried to talk to file daemons on multiple machines at the same time.

On a whim, I changed the "Maximum Concurrent Jobs" setting in my director configuration from 4 back down to the default of 1. Sometime in 2004 on the machine I used to run bacula on, I experimented with concurrent jobs and had great success with it, so I didn't think anything about keeping the same configuration on this new machine and new version of bacula.

Since I've made that change, I've queued up a lot of full backup jobs, and bacula has been chewing through them just great for the last five hours now.

Admittedly the whole server could crash any minute now, and it's likely waiting until I send this email just to spite me, but right now I'm thinking that something changed between 1.36.2 and 1.36.3 that keeps me from being able to run more than one job at a time now, or my hardware just can't handle it. Either way, if the system keeps running like this, I'm happy. I'll probably just need to reorder my jobs and cut down the retry time for those clients that sometimes get turned off at night so I don't get stuck waiting on them.

Thanks for the tips


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to