Hello, we are running bacula 3.0.1 on several Debian "Lenny" 5.0 servers
Debian packages are self-compiled backporting sources from the debian unstable distribution to the stable one On one environment with a tape drive since the upgrade we are experiencing problems when the customer forget to change the tape: jobs are no more cancelled. I configured bacula so that when last backup (catalog job) ends the tape is ejected, this is the device in bacula-sd.conf Device { Name = LTO2 Media Type = LTO2 Archive Device = /dev/nst0 AutomaticMount = yes; # when device opened, read it AlwaysOpen = yes; RemovableMedia = yes; RandomAccess = no; Offline On Unmount = yes; # eject tape on unmount Maximum Spool Size = 9900000000; Spool Directory = /var/spool/bacula; } I used "Offline On Unmount = yes;" and when the catalog job ends it unmount the tape So usually if the customer forget to change the tape this means that no tape is inserted. In the bacula-dir.conf we have configured the jobs this way: JobDefs { Name = "ServerJob" Type = Backup Level = Full Schedule = "WeeklyCycle" Storage = LTO-HENRY Messages = Standard Pool = Daily Priority = 10 # After 10 1/2 hours from scheduled time cancel job if volume not mounted Max Wait Time = 37800 # After 10 1/2 hours from scheduled time cancel job if not started Max Start Delay = 37800 SpoolData = yes } # Define jobs Job { Name = "HenryBackup" JobDefs = "ServerJob" Client = henry-fd FileSet = "Full Set Linux" Priority = 10 Write Bootstrap = "/var/lib/bacula/henry.bsr" # mount tape before first backup Run Before Job = "/usr/local/bin/bacula_mount_tape" } Job { Name = "FalstaffBackup" JobDefs = "ServerJob" Client = falstaff-fd FileSet = "Full Set Linux" Priority = 30 Write Bootstrap = "/var/lib/bacula/falstaff.bsr" } ... (other servers jobs) # Backup the catalog database (after the nightly save) Job { Name = "HenryCatalog" JobDefs = "ServerJob" Client = henry-fd Level = Full FileSet = "Catalog" Schedule = "WeeklyCycleAfterBackup" # This creates an ASCII copy of the catalog RunBeforeJob = "/usr/local/bin/backup_bacula_catalog" # This unmount the tape after last backup of the day or after failure/cancel RunScript { Command = "/usr/local/bin/bacula_unmount_tape" RunsWhen = After RunsOnFailure = yes RunsOnClient = no RunsOnSuccess = yes } Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr" Priority = 100 # run after main backup # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if volume not mounted # this must be 20 minute less than Max Wait Time for backup jobs Max Wait Time = 36600 # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if not started # this must be 20 minute less than Max Start Delay for backup jobs Max Start Delay = 36600 } /usr/local/bin/bacula_mount_tape contains the following command: #!/bin/sh /usr/bin/bconsole < /usr/local/share/bacula/mount.txt /usr/local/share/bacula/mount.txt contains: mount quit /usr/local/bin/bacula_unmount_tape contains the following command: #!/bin/sh /usr/bin/bconsole < /usr/local/share/bacula/unmount.txt /usr/local/share/bacula/unmount.txt contains: unmount quit Before the upgrade (bacula version 2.4.4) "Max Wait Time = 37800" was enough to get the job cancelled when no tape was inserted Note: this didn't work until in the catalog job we put "Max Start Delay = 36600" that is 20 minute less than first job. The first job is scheduled at 22:00, catalog job is scheduled at 22:10 I think that if both jobs have "Max Wait Time = 37800" this is what happens: - 22:00 first backup starts, no tape is inserted, the job starts waiting - 22:10 catalog job is queued - 8:30 first job is cancelled -> catalog job starts waiting on the tape mount, it would get deleted after 10:30 hours after this time If "Max Start Delay = 36600" is used for the catalog jobs: - 22:00 first backup starts, no tape is inserted, the job starts waiting - 22:10 catalog job is queued - 8:20 catalog job has not started, it is canceled - 8:30 first job is cancelled Could this be correct? As I said it did work with 2.4.4 After the upgrade (bacula version 3.0.1) the jobs are not deleted anymore, even with this configuration. A status director shows the following: *s dir henry-dir Version: 3.0.1 (30 April 2009) x86_64-pc-linux-gnu debian 5.0.1 Daemon started 30-Jun-09 07:17, 18 Jobs run since started. Heap: heap=5,373,952 smbytes=226,039 max_bytes=226,616 bufs=743 max_bufs=748 Scheduled Jobs: Level Type Pri Scheduled Name Volume =================================================================================== Full Backup 10 06-Jul-09 22:00 HenryBackup *unknown* Full Backup 20 06-Jul-09 22:00 KatherineBackup *unknown* Full Backup 30 06-Jul-09 22:00 FalstaffBackup *unknown* Full Backup 40 06-Jul-09 22:00 Server2003Backup *unknown* Full Backup 50 06-Jul-09 22:00 MortimerBackup *unknown* Full Backup 100 06-Jul-09 22:10 HenryCatalog *unknown* ==== Running Jobs: Console connected at 04-Jul-09 15:30 Console connected at 05-Jul-09 15:30 Console connected at 06-Jul-09 15:14 Console connected at 06-Jul-09 15:24 JobId Level Name Status ====================================================================== 2363 Full HenryBackup.2009-07-03_22.00.00_35 is waiting for a mount request 2364 Full KatherineBackup.2009-07-03_22.00.00_36 is waiting execution 2365 Full FalstaffBackup.2009-07-03_22.00.00_37 is waiting execution 2366 Full Server2003Backup.2009-07-03_22.00.00_38 is waiting execution 2367 Full MortimerBackup.2009-07-03_22.00.00_39 is waiting execution 2368 Full HenryCatalog.2009-07-03_22.10.00_41 is waiting execution ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 2353 Full 399,299 43.32 G OK 02-Jul-09 05:03 FalstaffBackup 2354 Full 36,717 9.304 G OK 02-Jul-09 05:48 Server2003Backup 2355 Full 31,812 662.1 M OK 02-Jul-09 05:58 MortimerBackup 2356 Full 1 286.9 M OK 02-Jul-09 06:01 HenryCatalog 2357 Full 277,949 80.08 G OK 03-Jul-09 01:35 HenryBackup 2358 Full 90,621 9.272 G OK 03-Jul-09 02:32 KatherineBackup 2359 Full 398,727 43.73 G OK 03-Jul-09 05:06 FalstaffBackup 2360 Full 36,722 9.184 G OK 03-Jul-09 05:51 Server2003Backup 2361 Full 31,812 663.7 M OK 03-Jul-09 06:04 MortimerBackup 2362 Full 1 287.1 M OK 03-Jul-09 06:07 HenryCatalog a "status storage" hangs in this condition if we try to stop the storage daemon with init script it does not stop it is necessary to kill -9 the process and then restart it after killing the storage daemon and restarting it we got these messages (we are using TLS to encrypt all comunications): 06-Jul 15:35 henry-dir JobId 0: Fatal error: authenticate.c:120 Director unable to authenticate with Storage daemon at "henry.cooporso.it:9103". Possible causes: Passwords or names not the same or Maximum Concurrent Jobs exceeded on the SD or SD networking messed up (restart daemon). Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000 for help. 06-Jul 15:35 henry-dir JobId 0: Failed to connect to Storage daemon. 04-Jul 13:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or label a new one for: Job: HenryBackup.2009-07-03_22.00.00_35 Storage: "LTO2" (/dev/nst0) Pool: PoolVenerdi1 Media type: LTO2 06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to client:192.168.10.5:36131, but only 0 accepted. 05-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or label a new one for: Job: HenryBackup.2009-07-03_22.00.00_35 Storage: "LTO2" (/dev/nst0) Pool: PoolVenerdi1 Media type: LTO2 06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to client:192.168.10.5:36131, but only 0 accepted. 06-Jul 05:00 henry-sd JobId 2363: Please mount Volume "Venerdi1" or label a new one for: Job: HenryBackup.2009-07-03_22.00.00_35 Storage: "LTO2" (/dev/nst0) Pool: PoolVenerdi1 Media type: LTO2 06-Jul 15:34 henry-sd JobId 2363: Fatal error: Job 2363 canceled. 06-Jul 15:34 henry-sd JobId 2363: Fatal error: fd_cmds.c:170 Command error with FD, hanging up. Append data error. 06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused 06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to client:192.168.10.5:36131, but only 0 accepted. 06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry 06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused 06-Jul 15:35 henry-dir JobId 0: Error: bsock.c:346 Wrote 19 bytes to client:192.168.10.5:36131, but only 0 accepted. 06-Jul 15:35 henry-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry 06-Jul 15:35 henry-dir JobId 0: Fatal error: bsock.c:135 Unable to connect to Storage daemon on henry.cooporso.it:9103. ERR=Connection refused 04-Jul 08:30 henry-dir JobId 2363: Fatal error: Max wait time exceeded. Job canceled. 06-Jul 15:35 henry-dir JobId 2363: Error: message.c:730 Operator mail program terminated in error. CMD=/usr/lib/bacula/bsmtp -h localhost -f "(Bacula henry.cooporso.it) backupad...@localhost" -s "Bacula: Intervention needed forHenryBackup.2009-07-03_22.00.00_35" backupad...@localhost ERR=Child died from signal 15: Termination After experiencing the problem a first time I modified the configuration this way: JobDefs { Name = "ServerJob" Type = Backup Level = Full Schedule = "WeeklyCycle" Storage = LTO-HENRY Messages = Standard Pool = Daily Priority = 10 # After 10 1/2 hours from scheduled time cancel job if volume not mounted Max Wait Time = 37800 # After 10 1/2 hours from scheduled time cancel job if not started Max Start Delay = 37800 # After 20 hours from scheduled time cancel job if not completed Max Run Sched Time = 72000 SpoolData = yes } # Backup the catalog database (after the nightly save) Job { Name = "HenryCatalog" JobDefs = "ServerJob" Client = henry-fd Level = Full FileSet = "Catalog" Schedule = "WeeklyCycleAfterBackup" # This creates an ASCII copy of the catalog RunBeforeJob = "/usr/local/bin/backup_bacula_catalog" # This unmount the tape after last backup of the day or after failure/cancel RunScript { Command = "/usr/local/bin/bacula_unmount_tape" RunsWhen = After RunsOnFailure = yes RunsOnClient = no RunsOnSuccess = yes } Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr" Priority = 100 # run after main backup # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if volume not mounted # this must be 20 minute less than Max Wait Time for backup jobs Max Wait Time = 36600 # After (10 1/2 hours - 20 minutes) from scheduled time cancel job if not started # this must be 20 minute less than Max Start Delay for backup jobs Max Start Delay = 36600 # After 20 hours from scheduled time cancel job if not completed # this must be 20 minute less than Max Run Sched Time for backup jobs Max Run Sched Time = 70800 } I added "Max Run Sched Time = 72000" but the job got stuck anyway (they are not running). It seems like "Max Wait Time" is ignored or causes the storage daemon to stop responding. Tell me if you need more info or complete config files. Best regards Massimiliano Ferrero -- Massimiliano Ferrero Midhgard s.r.l. C/so Svizzera 185 bis c/o centro Piero della Francesca 10149 - Torino tel. +39-0117575375 fax +39-0117768576 e-mail: m.ferr...@midhgard.it sito web: http://www.midhgard.it ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users