Am I correct in assuming that the watchdog timer is killing my jobs?

I have to constrain my backup jobs to less that 15 MBps with traffic 
shaping in this environment to avoid bandwidth contention, so even small 
jobs take a long time for bacula-fd to copy over.

Assuming I am correct that the watchdog timer is to blame, what's the 
work around?  The traffic shaping is a hard client requirement.


23-Dec 14:02 bacula-dir JobId 60: Using Device "Drive0" to write.
23-Dec 14:02 Scalar-i40 JobId 60: Spooling data ...
29-Dec 14:02 bacula-dir JobId 60: Error: Watchdog sending kill after 
518417 secs to thread stalled reading Storage daemon.
29-Dec 14:02 bacula-dir JobId 60: Error: Watchdog sending kill after 
518417 secs to thread stalled reading File daemon.
29-Dec 14:02 bacula-dir JobId 60: Fatal error: Network error with FD 
during Backup: ERR=Interrupted system call
29-Dec 14:02 bacula-dir JobId 60: Error: Director's comm line to SD dropped.
29-Dec 14:02 bacula-dir JobId 60: Fatal error: No Job status returned 
from FD.
29-Dec 14:02 bacula-dir JobId 60: Error: Bacula bacula-dir 5.2.13 (19Jan13):
   Build OS:               x86_64-redhat-linux-gnu redhat (Core)
   JobId:                  60
   Job:                    GCC_archive-everyone-else.2019-12-23_14.02.07_03
   Backup Level:           Full
   Client:                 "bock" 5.2.13 (19Jan13) 
x86_64-redhat-linux-gnu,redhat,(Core)
   FileSet:                "GCC_archive-everyone-else-fileset" 
2019-12-23 13:59:50
   Pool:                   "lto6-pool" (From Job resource)
   Catalog:                "GACCatalog" (From Client resource)
   Storage:                "Scalar-i40" (From Pool resource)
   Scheduled time:         23-Dec-2019 14:02:05
   Start time:             23-Dec-2019 14:02:09
   End time:               29-Dec-2019 14:02:26
   Elapsed time:           6 days 17 secs
   Priority:               12
   FD Files Written:       0
   SD Files Written:       0
   FD Bytes Written:       0 (0 B)
   SD Bytes Written:       0 (0 B)
   Rate:                   0.0 KB/s
   Software Compression:   None
   VSS:                    no
   Encryption:             no
   Accurate:               no
   Volume name(s):
   Volume Session Id:      6
   Volume Session Time:    1577127434
   Last Volume Bytes:      387,072 (387.0 KB)
   Non-fatal FD errors:    3
   SD Errors:              0
   FD termination status:  Error
   SD termination status:  Error
   Termination:            *** Backup Error ***



[root@bock Job]# cat GCC_archive-everyone-else-job.conf
#JN job file for Bock
#----------------------------------
Job {
   Name = GCC_archive-everyone-else
   Type = Backup
   Client = bock
   Schedule = ManualOnly
   Messages = Daemon
   FileSet = GCC_archive-everyone-else-fileset
   Level = Full
   Pool = lto6-pool
   Priority = 12
   Max Run Time = 8035200 # default limit is 6 days, 518400sec. bumped 
3x just in case
   Spool Data = yes
   Spool Attributes = yes  # spools catalog entries to disk until after 
file is backed up.  If a job fails, catalogue remains clean

##JN backup Bacula DB when job is done.
RunScript {
         RunsWhen = After
         FailJobOnError = No
         Command = "/usr/local/sbin/backup-bacula-db.sh"
    }
}
#----------------------------------



https://www.bacula.org/5.2.x-manuals/en/main/main/Configuring_Director.html

Max Run Time = time
     The time specifies the maximum allowed time that a job may run, 
counted from when the job starts, (not necessarily the same as when the 
job was scheduled).
     By default, the the watchdog thread will kill any Job that has run 
more than 6 days. The maximum watchdog timeout is independent of 
MaxRunTime and cannot be changed.

-- 
Thanks,

John H. Nyhuis
Desk: (206)-685-8334
jnyh...@uw.edu
Box 359461, 15th floor, 106

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to