Part of the problem is it takes upwards of ten minutes for a job to fail
when a workstation isn't available -- which is entirely correct, since the
network connection has to time out. However, the SD reservation is made
*before* it tries to contact the FD, so I end up with resource starvation
where jobs that are waiting to time out tie up resources that could be used
by other jobs. I'm guessing the assumption is that clients will always be
available, but the SD might be maxed out, so the code assumes it's more
efficient not to contact a client until the director knows it has the
resources to actually run the job.

One option would be to stagger the start times of my jobs so only the
maximum the SD can handle get launched in any given 10 minute window, but
that adds a lot of complexity to my configuration, since I currently can
just allow JobDefs to pull in the schedule for all clients. I'd have to
define start times individually, and maintain those in order to keep them
balanced as I add/remove clients. Adding enough disks for the worst case
isn't going to be possible. (I'm assuming one client per spindle is optimal
for disk arrays -- maybe that's too conservative?)

I've just been putting up with the error messages rather than deal with the
added maintenance of that approach. The extra alert emails can be dealt
with by filtering my incoming email.


On Thu, Sep 26, 2019 at 1:28 AM Kern Sibbald <k...@sibbald.com> wrote:

> Hello,
>
> Bacula does already attempt to acquire the needed devices in the SD and
> then backs them out if all the needed resources cannot be obtained.
> This works quite nicely.   Consequently, while the job is waiting the
> resources are released in the SD.
>
> The problem occurs because the SD realizes that the resources are not
> available, so it will wait a short period of time trying again to
> acquire the resources, which is what one wants for virtually all jobs.
> When it cannot acquire the resources the SD will fail the job.  The
> problem occurs because the user is over committing the SD resources.
> The solution is to get more drives or modify how you run jobs.
>
> From what I understand in this case is that the user has a large number
> of jobs that regularly fail and thus the user explicitly over commits
> the resources.  The consequent is that Bacula works as it should but the
> user gets lots of messages about the SD not being able to get resources.
>
> Bacula was designed in a way were it expects to have the needed
> resources available (i.e. the configuration should be optimized for the
> available resources).  It also handles the case where you over load the
> SD (too many jobs for available resources), but in that case it will
> warn you, which is exactly what 99% of all users want.
>
> One possible solution would be to add a new directive that suppresses
> the reservation failure message.  However there is very likely a better
> solution with the existing Bacula, I just do not know what it is at this
> time.  This is the first time in 19 years that this problem has come up,
> so before changing anything in the code, it has to be very clearly
> understood, which is not the case (at least for me).
>
> Another solution is for the user to modify the source code and remove
> the warning message.
>
> Best regards,
> Kern
>
> On 9/25/19 10:50 AM, Andrea Venturoli wrote:
> > On 2019-09-25 10:19, Radosław Korzeniewski wrote:
> >> Hello,
> >>
> >> sob., 21 wrz 2019 o 00:52 David Brodbeck <brodb...@math.ucsb.edu
> >> <mailto:brodb...@math.ucsb.edu>> napisał(a):
> >>
> >>     I think this is a somewhat unfortunate design decision, to be
> >>     honest. (...)
> >>
> >>
> >> So what should be the best design in this case which should solve the
> >> problem?
> >
> > I'm not so into the code to tell for sure.
> > Maybe rescheduling should release the SD once the job first fails and
> > reserve again when it starts the next time?
> >
> >  bye & Thanks
> >     av.
> >
>
>

-- 
David Brodbeck
System Administrator, Department of Mathematics
University of California, Santa Barbara
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to