[Bacula-users] Stuck jobs and Max Run Time

Mark Hazen Tue, 18 Sep 2007 07:27:48 -0700

Hi folks-

We use bacula to backup a dozen servers and about twice that many 
workstations here. The servers are all running the 2.2.3, the backup 
server hosts the director and storage daemon both, running on RHEL4. I'm 
using RPMs built from the bacula SRPM, without the GUI tools (which 
won't currently compile as noted in the SRPM, but that doesn't affect us 
here).


Last night one of our workstations started its backup, and just sat 
there. This morning (11 hours later) I could contact the client from 
bconsole, it stated that it was running the job, but the file/byte 
counts were at zero.

The last thing the server side log has listed is the completion of the 
previous job. It should be noted that the client's FD was version 1.38, 
but I am (perhaps mistakenly) under the impression that this should not 
be an issue, unless I were trying to use some of the 2.x only features 
(which I wasn't).

I was a little concerned that a job was 'stuck' for so long with no 
progress, but I can understand why the server didn't consider it 'dead'; 
it was still responding cheerfully, stating that it had a job in 
progress, which never progressed. Chalk it up to perhaps a flaky XP 
client machine in need of a restart.

Upon cancelling the job however, the pending jobs were stuck with the 
infamous "waiting on max storage jobs" notice:

Running Jobs:
JobId Level   Name                       Status
======================================================================
  104 Increme  job.xxx.backup.2007-09-17_19.05.15 has been canceled
  105 Increme  job.yyy.2007-09-17_19.05.16 is waiting on max Storage jobs
  106 Increme  job.zzz.2007-09-17_19.05.17 is waiting on max Storage jobs
  ... and so on.

Sometimes in the past, explicitly requesting the storage daemon to 
remount its devices has caused cancelled jobs 'stuck' in this manner to 
release, but not this time. In this case, I received contradictory 
messages from bacula:

*unmount
The defined Storage resources are:
      1: storage.servers
      2: storage.desktops
      3: storage.rescue
Select Storage resource (1-3): 1
3901 Device "device.servers" (/bacula/pools/server) is already unmounted.
*mount
The defined Storage resources are:
      1: storage.servers
      2: storage.desktops
      3: storage.rescue
Select Storage resource (1-3): 1
3906 File device "device.servers" (/bacula/pools/server) is always mounted.
*q

I'm not sure if this qualifies as an issue, but it was a bit of a 
headscratch for me. Restarting the daemons cleared the problem, but also 
dropped all of the uncompleted jobs, which I wish it hadn't.

So, I'm adding "Max Run Time" entries to the desktop backup 
configuration, in the JobDef block for desktops, but the question 
exists, does this stop the job at the client level or at the server 
level? I'm thinking that stopping it at the client level won't help (as 
far as I can see) with zombie clients, so I just wanted to make sure 
this would indeed resolve our issues when a client goes loopy.

Thanks,
-mh.
-- 
Mark Hazen
Systems Support Specialist
The University of Georgia

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] Stuck jobs and Max Run Time

Reply via email to