Hi,

18.09.2007 16:25,, Mark Hazen wrote::
> Hi folks-
> 
> We use bacula to backup a dozen servers and about twice that many 
> workstations here. The servers are all running the 2.2.3,

Time for another upgrade - 2.2.4 fixes a serious bug.

...
> Last night one of our workstations started its backup, and just sat 
> there. This morning (11 hours later) I could contact the client from 
> bconsole, it stated that it was running the job, but the file/byte 
> counts were at zero.
> 
> The last thing the server side log has listed is the completion of the 
> previous job. It should be noted that the client's FD was version 1.38, 
> but I am (perhaps mistakenly) under the impression that this should not 
> be an issue, unless I were trying to use some of the 2.x only features 
> (which I wasn't).

Right... although nobody can give you a guarantee regarding that, this 
version mismatch should not matter.

> I was a little concerned that a job was 'stuck' for so long with no 
> progress, but I can understand why the server didn't consider it 'dead'; 
> it was still responding cheerfully, stating that it had a job in 
> progress, which never progressed. Chalk it up to perhaps a flaky XP 
> client machine in need of a restart.
> 
> Upon cancelling the job however, the pending jobs were stuck with the 
> infamous "waiting on max storage jobs" notice:
> 
> Running Jobs:
> JobId Level   Name                       Status
> ======================================================================
>   104 Increme  job.xxx.backup.2007-09-17_19.05.15 has been canceled
>   105 Increme  job.yyy.2007-09-17_19.05.16 is waiting on max Storage jobs
>   106 Increme  job.zzz.2007-09-17_19.05.17 is waiting on max Storage jobs
>   ... and so on.
> 
> Sometimes in the past, explicitly requesting the storage daemon to 
> remount its devices has caused cancelled jobs 'stuck' in this manner to 
> release, but not this time. In this case, I received contradictory 
> messages from bacula:
> 
> *unmount
> The defined Storage resources are:
>       1: storage.servers
>       2: storage.desktops
>       3: storage.rescue
> Select Storage resource (1-3): 1
> 3901 Device "device.servers" (/bacula/pools/server) is already unmounted.
> *mount
> The defined Storage resources are:
>       1: storage.servers
>       2: storage.desktops
>       3: storage.rescue
> Select Storage resource (1-3): 1
> 3906 File device "device.servers" (/bacula/pools/server) is always mounted.
> *q
> 
> I'm not sure if this qualifies as an issue, but it was a bit of a 
> headscratch for me. Restarting the daemons cleared the problem, but also 
> dropped all of the uncompleted jobs, which I wish it hadn't.

This can happen if the FD has lost its connection to the SD. The SD 
sees a (on its side) open connection from the FD and can't do much 
except to wait for the FD.

Even in case the job is cancelled, the FD has to contact the SD at job 
termination.

> So, I'm adding "Max Run Time" entries to the desktop backup 
> configuration, in the JobDef block for desktops, but the question 
> exists, does this stop the job at the client level or at the server 
> level?

In your terminology, still at the server level. The DIR does all that 
book-keeping, so it informs the FD to stop its currently running job.

> I'm thinking that stopping it at the client level won't help (as 
> far as I can see) with zombie clients, so I just wanted to make sure 
> this would indeed resolve our issues when a client goes loopy.

Hmm... difficult question. I'm not sure. It might help to add 
heartbeat to the configuration, so the SD can notice the job is stuck, 
but I assume the FD might still happily send (or reply to) heartbeats, 
even when the job is cancelled.

I guess you'll have to give it a try, which might take some time, if 
your original problem can't be reproduced.

Arno

> Thanks,
> -mh.

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to