**Problem Description** Trying to get the following configuration to work (the real one is more complex, but this is a minimal example, although it's still quite long):
Imagine having 3 hosts, all linux. Host Alice: Runs bacula-fd sending local data to Bob. Host Bob: Runs bacula-fd, bacula-sd, bacula-dir, pointing to a NAS Host Carl: Runs Bacula-SD, pointing to local zfs filesystem. I'm trying to get the backups that are made daily from host 'Alice' to host 'Bob' to be copied so they are also stored in host 'Carl' for redundancy. Searching, Bacula seems to have a [copy job][1] made for this. I cannot get these to work. Instead of copying files, these copy jobs cause the job (and everything that depends on it) to deadlock. **Configuration** The definition of the job, and copy job, stored on host Bob at `Bob:/etc/bacula/jobdefs/alice.conf` looks like the following: Job { Name = "alice-job" Client = "alice-fd" FileSet = alice-fileset Pool = Default Full Backup Pool = alice-full-pool Incremental Backup Pool = alice-inc-pool Differential Backup Pool = alice-diff-pool Enabled = Yes Type = Backup Schedule = WeeklyCycle Messages = Standard Storage = sdef-bob RunScript { RunsOnClient = No Command = "/etc/bacula/runCopyJob.sh -n %n -i %i -p %p -l %l" RunsWhen = After } Maximum Concurrent Jobs = 63 ## Testing various values here. } # Supposed to copy the Full backups originally from Alice from Bob to Carl # Not working yet... Job { Name = "alice-copy-full" Client = "bob-fd" FileSet = alice-fileset Type = Copy Pool = alice-full-pool # Copies all jobs that have not yet been copied. Selection Type = PoolUncopiedJobs Messages = Standard Enabled = Yes # Note: This is the source, not the target storage Storage = bob-storage Maximum Concurrent Jobs = 19 ## Trying out Pool jobs x 2 + 1 } Job { Name = "alice-copy-inc" Client = "bob-fd" FileSet = alice-fileset Type = Copy Pool = alice-inc-pool Selection Type = PoolUncopiedJobs Messages = Standard Enabled = Yes Storage = bob-storage Maximum Concurrent Jobs = 63 } Job { Name = "alice-copy-diff" Client = "bob-fd" FileSet = alice-fileset Type = Copy Pool = alice-diff-pool Selection Type = PoolUncopiedJobs Messages = Standard Enabled = Yes Storage = bob-storage # Note: This is the source, not the target storage Maximum Concurrent Jobs = 63 } I use a little bash script `runCopyJob.sh` that manually calls the copy job after the regular job completes. This does the same as running the job manually via the console. (Suppose the job is #23 in the list, you use the commandline to run: `bconsole`, `run` then `23`) The `WeeklyCycle` schedule does a Full backup each month on the 1st, then incremental/differential backups daily on weekdays. Side note: `63 = 31 * 2 + 1`. There are 31 days in the longest month. Each copy task spawns two jobs, and the main task is another job. Thus there are at most 63 jobs if the 'copy job' functionality is enabled for a client on the 31st if backups run at most daily. I've been experimenting with the Maximum Concurrent Jobs value but to no avail. It can be put in many places, and I'm not sure which overrides which. Trial and Error for the endless combinations possible is taking too long. Next, there's also some Backup pools defined for Alice (a total of 6). These are defined on Bob at `Bob:/etc/bacula/pooldefs/` Pool { Name = alice-full-pool Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 6 months Maximum Volume Jobs = 1 Maximum Volumes = 9 Label Format = alice-full- Next Pool = alice-full-pool-carl } Pool { Name = alice-inc-pool Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 20 days Maximum Volume Jobs = 6 Maximum Volumes = 7 Label Format = alice-inc- Next Pool = alice-inc-pool-carl } Pool { Name = alice-diff-pool Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 40 days Maximum Volume Jobs = 1 Maximum Volumes = 10 Label Format = alice-diff- Next Pool = alice-diff-pool-carl } Pool { Name = alice-full-pool-carl Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 6 months Maximum Volume Jobs = 1 Maximum Volumes = 9 Label Format = alice-full- Storage = sdef-carl } Pool { Name = alice-inc-pool-carl Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 20 days Maximum Volume Jobs = 6 Maximum Volumes = 7 Label Format = alice-inc- Storage = sdef-carl } Pool { Name = alice-diff-pool-carl Pool Type = Backup Recycle = yes AutoPrune = yes Volume Retention = 40 days Maximum Volume Jobs = 1 Maximum Volumes = 10 Label Format = alice-diff- Storage = sdef-carl } The director (@ Bob) is configured via `Bob:/etc/bacula/bacula-dir.conf`like so: Director { Name = bob-dir DIRport = 9101 QueryFile = "/etc/bacula/query.sql" WorkingDirectory = "/opt/bacula/working" PidDirectory = "/var/run" Maximum Concurrent Jobs = 63 ## Trying out higher values here... Password = "dir-bob-password" Messages = Daemon TLS Certificate = /etc/bacula/certs/Bob.example.com.crt TLS Key = /etc/bacula/certs/Bob.example.com.key TLS CA Certificate File= /etc/bacula/certs/myca.crt TLS Enable = yes TLS Require = yes TLS Verify Peer = yes TLS Allowed CN = bob.example.com } Schedule { Name = "WeeklyCycle" Run = Full 1st sun at 23:05 Run = Differential 2nd-5th sun at 23:05 Run = Incremental mon-sat at 23:05 } The Storage Demon at Bob is configured via `Bob:/etc/bacula/bacula-sd.conf`: Storage { Name = bob-sd SDPort = 9103 # Director's port WorkingDirectory = "/opt/bacula/working" Pid Directory = "/var/run" Maximum Concurrent Jobs = 40 SDAddress = bob.example.com TLS Certificate = /etc/bacula/certs/bob.example.com.crt TLS Key = /etc/bacula/certs/bob.example.com.key TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Enable = yes TLS Require = yes TLS Verify Peer = yes TLS Allowed CN = bob.example.com, carl.example.com } Director { Name = bob-dir Password = "sd-bob-dir-bob-pw" } Device { Name = bob-storage Media Type = File Archive Device = /FileStorage01 # Path where the NAS is mounted. LabelMedia = yes; Random Access = Yes; AutomaticMount = yes; RemovableMedia = no; AlwaysOpen = no; Maximum Concurrent Jobs = 20 } The storage demon at 'carl' is configured with the same file on this host; Storage { # definition of myself Name = carl-sd SDPort = 9103 # Director's port WorkingDirectory = "/var/lib/bacula" Pid Directory = "/run/bacula" Plugin Directory = "/usr/lib/bacula" Maximum Concurrent Jobs = 63 SDAddress = 0.0.0.0 TLS Enable = Yes TLS Require = Yes TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Certificate = /etc/bacula/certs/carl.example.com.crt TLS Key = /etc/bacula/certs/carl.example.com.key TLS Verify Peer = Yes TLS Allowed CN = bob.example.com } # # List Directors who are permitted to contact Storage daemon # Director { Name = bob-dir Password = "sd-carl-dir-bob-pw" TLS Enable = Yes TLS Require = Yes TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Certificate = /etc/bacula/certs/carl.example.com.crt TLS Key = /etc/bacula/certs/carl.example.com.key } Device { Name = carl-storage Media Type = File Archive Device = /zfs1/external/bob/backups LabelMedia = yes; # lets Bacula label unlabeled media Random Access = Yes; AutomaticMount = yes; # when device opened, read it RemovableMedia = no; AlwaysOpen = no; Maximum Concurrent Jobs = 20 } Both storage demons are in a file `/etc/bacula/storagedefs/file.conf` so the director over at Bob can find them and orchestrate the file transfer. Storage { Name = sdef-bob # Do not use "localhost" here # Note: Use a fully qualified name here Address = bob.example.com SDPort = 9103 Password ="sd-bob-dir-bob-pw" Device = bob-storage Media Type = File Maximum Concurrent Jobs = 63 } Storage { Name = sdef-carl Address = carl.example.com SDPort = 9103 Password = "sd-carl-dir-bob-pw" Device = carl-storage Media Type = File Maximum Concurrent Jobs = 63 TLS Enable = yes TLS Require = yes TLS Certificate = /etc/bacula/certs/carl.example.com.crt TLS Key = /etc/bacula/certs/carl.example.com.key TLS CA Certificate File = /etc/bacula/certs/myca.crt } The original backup and the copy job are done by FileDaemons. The configuration for Bob's file daemon is at `bob:/etc/bacula/bacula-fd.conf`: # # List Directors who are permitted to contact this File daemon # Director { Name = bacula01-dir Password = "fd-bob-dir-bob-pw" Address = bob.example.com TLS Certificate = /etc/bacula/certs/bob.example.com.crt TLS Key = /etc/bacula/certs/bob.example.com.key TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Enable = yes TLS Require = yes # TLS Allowed CN = “bob.example.com” } # # Restricted Director, used by tray-monitor to get the # status of the file daemon # Director { Name = bacula01-mon Password = "" Monitor = yes } # # "Global" File daemon configuration specifications # FileDaemon { # this is me Name = bob-fd FDport = 9102 # where we listen for the director WorkingDirectory = /opt/bacula/working Pid Directory = /var/run Maximum Concurrent Jobs = 20 FDAddress = bob.example.com TLS Enable = yes TLS Require = yes TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Certificate = /etc/bacula/certs/bob.example.com.crt TLS Key = /etc/bacula/certs/bob.example.com.key } And Alice's FileDaemon is configured at `alice:/etc/bacula/bacula-fd.conf` # # List Directors who are permitted to contact this File daemon # Director { Name = bob-dir Password = "alice-fd-bob-dir-pw" # same pass as on bob /etc/bacula/clientdefs/HOSTNAME-fd.conf TLS Enable = yes TLS Require = yes TLS Verify Peer = yes TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Certificate = /etc/bacula/certs/alice.example.com.crt TLS Key = /etc/bacula/certs/alice.example.com.key } # # Restricted Director, used by tray-monitor to get the # status of the file daemon # Director { Name = bacula01-mon Password = "RandomSecretDataHere" Monitor = yes } # # "Global" File daemon configuration specifications # FileDaemon { # this is me Name = "alice-fd" FDport = 9102 # where we listen for the director WorkingDirectory = /opt/bacula/working Pid Directory = /var/run Maximum Concurrent Jobs = 20 # Plugin Directory = /usr/lib FDAddress = alice.example.com TLS Enable = yes TLS Require = yes TLS CA Certificate File = /etc/bacula/certs/myca.crt TLS Certificate = /etc/bacula/certs/bob.example.com.crt TLS Key = /etc/bacula/certs/bob.example.com.key } # Send all messages except skipped files back to Director Messages { Name = Standard director = bob-dir = all, !skipped, !restored } The director knows about Alice's File Daemon because of what's in `bob:/etc/bacula/clientDefs/alice.example.com-fd.conf`: Client { Name = "alice-fd" Address = alice.example.com FDPort = 9102 Catalog = MyCatalog Password = "alice-fd-bob-dir-pw" File Retention = 30 days Job Retention = 6 months AutoPrune = yes TLS Enable = yes TLS Require = yes TLS Certificate = /etc/bacula/certs/alice.example.com.crt TLS Key = /etc/bacula/certs/alice.example.com.key TLS CA Certificate File = /etc/bacula/certs/myca.crt } Similarly for its own file daemon at `/etc/bacula/clientdefs/bob.example.com-fd.conf`: Client { Name = "bob-fd" Address = bob.example.com FDPort = 9102 Catalog = MyCatalog Password = "bob-fd-bob-dir-pw" File Retention = 14 days Job Retention = 1 months AutoPrune = yes TLS Enable = yes TLS Require = yes TLS Certificate = /etc/bacula/certs/bob.example.com.crt TLS Key = /etc/bacula/certs/bob.example.com.key TLS CA Certificate File = /etc/bacula/certs/myca.crt } **Problem Symptoms** When running this type of copy job, the job never completes Running , within `bconsole> status dir`, produces output akin to: JobId Type Level Files Bytes Name Status ====================================================================== 63642 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63643 Copy Full 0 0 alice-copy-inc is running 63644 Back Full 0 0 alice-job is running 63645 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63646 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63647 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63648 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63649 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63650 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63651 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63652 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63653 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63654 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63655 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63656 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63657 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63658 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" 63659 Copy Full 0 0 alice-copy-inc is waiting on Storage "sdef-bob" 63660 Back Full 0 0 alice-job is waiting on Storage "sdef-carl" The number of lines depends on the day of the month, two for every incremental backup plus 1 `2n+1` created, except on the first, where the last backup is a Full backup; in that case, it should give 7 lines. Once the first copy job has run and all the data is transferred, the next job should only ever produce 3 lines. Noticing how everyone is waiting on these storage, running `status storage` for say carl's storage outputs: carl-sd Version: 9.6.7 (10 December 2020) x86_64-pc-linux-gnu debian bullseye/sid Daemon started 21-Jun-22 12:08. Jobs: run=0, running=2. Heap: heap=401,408 smbytes=722,480 max_bytes=722,674 bufs=674 max_bufs=676 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0,0 newbsr=0 Res: ndevices=2 nautochgr=0 Running Jobs: Writing: Full Backup job alice-job JobId=63555 Volume="" pool="alice-Inc-Pool-carl" device="carl-storage" (/zfs1/external/bob/backups) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0 FDReadSeqNo=5 in_msg=5 out_msg=4 fd=6 Writing: Full Backup job alice-job JobId=63644 Volume="" pool="alice-Inc-Pool-carl" device="carl-storage" (/zfs1/external/bob/backups) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0 FDReadSeqNo=5 in_msg=5 out_msg=4 fd=15 ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 63350 Full 0 0 Created 20-Jun-22 14:21 alice-job 63352 Full 0 0 Created 20-Jun-22 14:21 alice-job 63354 Full 0 0 Created 20-Jun-22 14:21 alice-job 63355 Full 0 0 Created 20-Jun-22 14:21 alice-job 63358 Full 0 0 Cancel 20-Jun-22 14:37 alice-job 63360 Full 0 0 Cancel 20-Jun-22 14:55 alice-job 63384 Full 0 0 Cancel 20-Jun-22 15:27 alice-job 63459 Full 0 0 Cancel 21-Jun-22 09:57 alice-job 63483 Full 0 0 Cancel 21-Jun-22 09:58 alice-job 63396 Full 0 0 Cancel 21-Jun-22 12:08 alice-job ==== Device status: Device File: "carl-storage" (/zfs1/external/bob/backups) is not open. Device is BLOCKED waiting to create a volume for: Pool: alice-Inc-Pool-carl Media type: File Available Space=28.15 TB == ==== Used Volume status: ==== ==== No, manually creating a volume (which is supposed to automatically create itself) does not make the problem go away. It's possible to stop the director and run it in debug mode by redirecting its output. By executing `bacula-dir -d 201 -f > /var/log/bacula/run_debug_2.log 2>&1 &` it's possible to view some more information alongside the C code. Except... that results in hundreds of megabytes of logs which won't fit on this page. I'd like to know more what to look for.. One thing I found earlier was that at some point the code fails due to a variable `rncj` not being high enough. I looked at the C file in question, and it contains the following snippet; bool inc_read_store(JCR *jcr) { P(rstore_mutex); int num = jcr->rstore->getNumConcurrentJobs(); int numread = jcr->rstore->getNumConcurrentReadJobs(); int maxread = jcr->rstore->MaxConcurrentReadJobs; if (num < jcr->rstore->MaxConcurrentJobs && (jcr->getJobType() == JT_RESTORE || numread == 0 || maxread == 0 || /* No limit set */ numread < maxread)) /* Below the limit */ { num++; numread++; jcr->rstore->setNumConcurrentReadJobs(numread); jcr->rstore->setNumConcurrentJobs(num); Dmsg1(200, "Inc rncj=%d\n", num); V(rstore_mutex); return true; } V(rstore_mutex); return false; } The line `Inc rncj` would be missing, while the line before (in the job scheduling process) would be printed. Therefore the Job isn't allowed to proceed because it couldn't read. So I tried increasing the MaxConcurrentJobs variable, but it'll just keep cycling and never actually start the job, and rncj seems to increase without bound (there isn't even 63 total jobs queued). [1]: https://www.bacula.org/11.0.x-manuals/en/main/Migration_Copy.html _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users