Re: [Bacula-users] False "Intervention needed" flood

Martin Simmons Fri, 27 Sep 2019 05:04:08 -0700

Maybe you can add a RunBeforeJob script that checks if the workstation is
available?  If the script returns a non-zero status then the job will fail and
will not contact the SD at all.


__Martin


>>>>> On Thu, 26 Sep 2019 09:22:52 -0700, David Brodbeck said:
> 
> Part of the problem is it takes upwards of ten minutes for a job to fail
> when a workstation isn't available -- which is entirely correct, since the
> network connection has to time out. However, the SD reservation is made
> *before* it tries to contact the FD, so I end up with resource starvation
> where jobs that are waiting to time out tie up resources that could be used
> by other jobs. I'm guessing the assumption is that clients will always be
> available, but the SD might be maxed out, so the code assumes it's more
> efficient not to contact a client until the director knows it has the
> resources to actually run the job.
> 
> One option would be to stagger the start times of my jobs so only the
> maximum the SD can handle get launched in any given 10 minute window, but
> that adds a lot of complexity to my configuration, since I currently can
> just allow JobDefs to pull in the schedule for all clients. I'd have to
> define start times individually, and maintain those in order to keep them
> balanced as I add/remove clients. Adding enough disks for the worst case
> isn't going to be possible. (I'm assuming one client per spindle is optimal
> for disk arrays -- maybe that's too conservative?)
> 
> I've just been putting up with the error messages rather than deal with the
> added maintenance of that approach. The extra alert emails can be dealt
> with by filtering my incoming email.
> 
> 
> On Thu, Sep 26, 2019 at 1:28 AM Kern Sibbald <k...@sibbald.com> wrote:
> 
> > Hello,
> >
> > Bacula does already attempt to acquire the needed devices in the SD and
> > then backs them out if all the needed resources cannot be obtained.
> > This works quite nicely.   Consequently, while the job is waiting the
> > resources are released in the SD.
> >
> > The problem occurs because the SD realizes that the resources are not
> > available, so it will wait a short period of time trying again to
> > acquire the resources, which is what one wants for virtually all jobs.
> > When it cannot acquire the resources the SD will fail the job.  The
> > problem occurs because the user is over committing the SD resources.
> > The solution is to get more drives or modify how you run jobs.
> >
> > From what I understand in this case is that the user has a large number
> > of jobs that regularly fail and thus the user explicitly over commits
> > the resources.  The consequent is that Bacula works as it should but the
> > user gets lots of messages about the SD not being able to get resources.
> >
> > Bacula was designed in a way were it expects to have the needed
> > resources available (i.e. the configuration should be optimized for the
> > available resources).  It also handles the case where you over load the
> > SD (too many jobs for available resources), but in that case it will
> > warn you, which is exactly what 99% of all users want.
> >
> > One possible solution would be to add a new directive that suppresses
> > the reservation failure message.  However there is very likely a better
> > solution with the existing Bacula, I just do not know what it is at this
> > time.  This is the first time in 19 years that this problem has come up,
> > so before changing anything in the code, it has to be very clearly
> > understood, which is not the case (at least for me).
> >
> > Another solution is for the user to modify the source code and remove
> > the warning message.
> >
> > Best regards,
> > Kern
> >
> > On 9/25/19 10:50 AM, Andrea Venturoli wrote:
> > > On 2019-09-25 10:19, Radosław Korzeniewski wrote:
> > >> Hello,
> > >>
> > >> sob., 21 wrz 2019 o 00:52 David Brodbeck <brodb...@math.ucsb.edu
> > >> <mailto:brodb...@math.ucsb.edu>> napisał(a):
> > >>
> > >>     I think this is a somewhat unfortunate design decision, to be
> > >>     honest. (...)
> > >>
> > >>
> > >> So what should be the best design in this case which should solve the
> > >> problem?
> > >
> > > I'm not so into the code to tell for sure.
> > > Maybe rescheduling should release the SD once the job first fails and
> > > reserve again when it starts the next time?
> > >
> > >  bye & Thanks
> > >     av.
> > >
> >
> >
> 
> -- 
> David Brodbeck
> System Administrator, Department of Mathematics
> University of California, Santa Barbara
> 


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] False "Intervention needed" flood

Reply via email to