Hello,

On 12.10.2005 20:27, Luke Dean wrote:

...
I haven't tried 1.37 yet, but I did try several of your other suggestions.
I eventually figured out how to run the director inside the debugger, get some debugging information, and watch the machine lock up. What I saw was that the director tends to run in a loop where it talks to the other daemons, and occasionally gets interrupted with a scheduling routine. The system freeze would happen whenever the director tried to talk to file daemons on multiple machines at the same time.

If this is the case, you should seriously consider upgrading to 1.37.40, if not in production use, at least for testing purposes. Kern claims that he removed many problems concerning deadlocks and stuck processes, and I think he'd like to hear that he removed your problem, too :-)

On a whim, I changed the "Maximum Concurrent Jobs" setting in my director configuration from 4 back down to the default of 1. Sometime in 2004 on the machine I used to run bacula on, I experimented with concurrent jobs and had great success with it, so I didn't think anything about keeping the same configuration on this new machine and new version of bacula.

Since I've made that change, I've queued up a lot of full backup jobs, and bacula has been chewing through them just great for the last five hours now.

Admittedly the whole server could crash any minute now, and it's likely waiting until I send this email just to spite me,

:-)

but right now I'm thinking that something changed between 1.36.2 and 1.36.3 that keeps me from being able to run more than one job at a time now, or my hardware just can't handle it.

Hmm. Well, I know of people who used these versions with multiple jobs without serious problems, and I did, too. And if *my* hardware mages that, yours should, too. (iP200MMX, 128MB)

Either way, if the system keeps running like this, I'm happy. I'll probably just need to reorder my jobs and cut down the retry time for those clients that sometimes get turned off at night so I don't get stuck waiting on them.

That might help, too. You to limit the number of jobs started at the same instand, you could use a run before script that first pings a host and fails if the host is not up (good in combination with rerun failed jobs or whatever that's called) and then waits a random number of seconds before returning to the director.

Arno

Thanks for the tips


--
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to