Hello,
On 12.10.2005 20:27, Luke Dean wrote:
...
I haven't tried 1.37 yet, but I did try several of your other suggestions.
I eventually figured out how to run the director inside the debugger,
get some debugging information, and watch the machine lock up. What I
saw was that the director tends to run in a loop where it talks to the
other daemons, and occasionally gets interrupted with a scheduling
routine. The system freeze would happen whenever the director tried to
talk to file daemons on multiple machines at the same time.
If this is the case, you should seriously consider upgrading to 1.37.40,
if not in production use, at least for testing purposes. Kern claims
that he removed many problems concerning deadlocks and stuck processes,
and I think he'd like to hear that he removed your problem, too :-)
On a whim, I changed the "Maximum Concurrent Jobs" setting in my
director configuration from 4 back down to the default of 1. Sometime
in 2004 on the machine I used to run bacula on, I experimented with
concurrent jobs and had great success with it, so I didn't think
anything about keeping the same configuration on this new machine and
new version of bacula.
Since I've made that change, I've queued up a lot of full backup jobs,
and bacula has been chewing through them just great for the last five
hours now.
Admittedly the whole server could crash any minute now, and it's likely
waiting until I send this email just to spite me,
:-)
but right now I'm
thinking that something changed between 1.36.2 and 1.36.3 that keeps me
from being able to run more than one job at a time now, or my hardware
just can't handle it.
Hmm. Well, I know of people who used these versions with multiple jobs
without serious problems, and I did, too. And if *my* hardware mages
that, yours should, too. (iP200MMX, 128MB)
Either way, if the system keeps running like
this, I'm happy. I'll probably just need to reorder my jobs and cut
down the retry time for those clients that sometimes get turned off at
night so I don't get stuck waiting on them.
That might help, too. You to limit the number of jobs started at the
same instand, you could use a run before script that first pings a host
and fails if the host is not up (good in combination with rerun failed
jobs or whatever that's called) and then waits a random number of
seconds before returning to the director.
Arno
Thanks for the tips
--
IT-Service Lehmann [EMAIL PROTECTED]
Arno Lehmann http://www.its-lehmann.de
-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users