Maybe you can add a RunBeforeJob script that checks if the workstation is available? If the script returns a non-zero status then the job will fail and will not contact the SD at all.
__Martin >>>>> On Thu, 26 Sep 2019 09:22:52 -0700, David Brodbeck said: > > Part of the problem is it takes upwards of ten minutes for a job to fail > when a workstation isn't available -- which is entirely correct, since the > network connection has to time out. However, the SD reservation is made > *before* it tries to contact the FD, so I end up with resource starvation > where jobs that are waiting to time out tie up resources that could be used > by other jobs. I'm guessing the assumption is that clients will always be > available, but the SD might be maxed out, so the code assumes it's more > efficient not to contact a client until the director knows it has the > resources to actually run the job. > > One option would be to stagger the start times of my jobs so only the > maximum the SD can handle get launched in any given 10 minute window, but > that adds a lot of complexity to my configuration, since I currently can > just allow JobDefs to pull in the schedule for all clients. I'd have to > define start times individually, and maintain those in order to keep them > balanced as I add/remove clients. Adding enough disks for the worst case > isn't going to be possible. (I'm assuming one client per spindle is optimal > for disk arrays -- maybe that's too conservative?) > > I've just been putting up with the error messages rather than deal with the > added maintenance of that approach. The extra alert emails can be dealt > with by filtering my incoming email. > > > On Thu, Sep 26, 2019 at 1:28 AM Kern Sibbald <k...@sibbald.com> wrote: > > > Hello, > > > > Bacula does already attempt to acquire the needed devices in the SD and > > then backs them out if all the needed resources cannot be obtained. > > This works quite nicely. Consequently, while the job is waiting the > > resources are released in the SD. > > > > The problem occurs because the SD realizes that the resources are not > > available, so it will wait a short period of time trying again to > > acquire the resources, which is what one wants for virtually all jobs. > > When it cannot acquire the resources the SD will fail the job. The > > problem occurs because the user is over committing the SD resources. > > The solution is to get more drives or modify how you run jobs. > > > > From what I understand in this case is that the user has a large number > > of jobs that regularly fail and thus the user explicitly over commits > > the resources. The consequent is that Bacula works as it should but the > > user gets lots of messages about the SD not being able to get resources. > > > > Bacula was designed in a way were it expects to have the needed > > resources available (i.e. the configuration should be optimized for the > > available resources). It also handles the case where you over load the > > SD (too many jobs for available resources), but in that case it will > > warn you, which is exactly what 99% of all users want. > > > > One possible solution would be to add a new directive that suppresses > > the reservation failure message. However there is very likely a better > > solution with the existing Bacula, I just do not know what it is at this > > time. This is the first time in 19 years that this problem has come up, > > so before changing anything in the code, it has to be very clearly > > understood, which is not the case (at least for me). > > > > Another solution is for the user to modify the source code and remove > > the warning message. > > > > Best regards, > > Kern > > > > On 9/25/19 10:50 AM, Andrea Venturoli wrote: > > > On 2019-09-25 10:19, Radosław Korzeniewski wrote: > > >> Hello, > > >> > > >> sob., 21 wrz 2019 o 00:52 David Brodbeck <brodb...@math.ucsb.edu > > >> <mailto:brodb...@math.ucsb.edu>> napisał(a): > > >> > > >> I think this is a somewhat unfortunate design decision, to be > > >> honest. (...) > > >> > > >> > > >> So what should be the best design in this case which should solve the > > >> problem? > > > > > > I'm not so into the code to tell for sure. > > > Maybe rescheduling should release the SD once the job first fails and > > > reserve again when it starts the next time? > > > > > > bye & Thanks > > > av. > > > > > > > > > -- > David Brodbeck > System Administrator, Department of Mathematics > University of California, Santa Barbara > _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users