Re: [Bacula-users] Has anyone gotten copy jobs between multiple storage daemons working?

Bill Arlofski via Bacula-users Sat, 25 Jun 2022 11:23:15 -0700

On 6/24/22 08:08, Helpdesk - Net Products wrote:

**Problem Description**


Hello Nico,

First, congratulations for winning* the contest for longest email on the Bacula 
Community mailing list. :)

*All results are 100% unofficial, and no prizes will be delivered. :)


OK, there a lot to parse here, so I think I will just pick a few things that 
jumped out at me to correct, and some other
hints I can give to allow you to make this work successfully.

These will be in no particular order.

First, yes you can do Copy/Migration jobs between SDs.

I noticed that in your copy destination pools on Carl, those pools use the same 
'LabelFormat' as the pool that jobs would be
copied from from the Alice pools. I would make these copy destination pools on 
Carl have a different LabelFormat for clarity.
ie: Just looking at a volume's name, you will know where it lives (on Bob or on 
Carl)

Next, and this is *VERY* important:

You must have different 'MediaType' settings for th
e device(s) in bob-storage and the carl-storage, and they must match the
MediaType setting in the correlating "Storage{}  bob-storage and carl-storage 
resources in the Director's config that points
to them. Currently you have these all set to "MediaType = File" and this will 
not work.


I have no idea what this sentence means:
----8<----
Both storage demons are in a file `/etc/bacula/storagedefs/file.conf` so the 
director over at Bob can find them and
orchestrate the file transfer.
----8<----


You have not included the 'runCopyJob.sh' script, but I understand from your 
description what it does.

There are many ways to run copy jobs, and there are several ways to tell Bacula 
to choose jobs to be copied.

Since you are using the PoolUncopiedJobs option, then I would not trigger these 
Copy jobs from this script, and rather just
set them to run via schedule with a different Priority than the normal backup 
jobs do that the Copy jobs will be queued until
the backup jo
bs complete.


You have not shown us any job logs to know why Bacula is waiting to create a 
volume. We only see this on the carl-sd status:
----8<----
Device File: "carl-storage" (/zfs1/external/bob/backups) is not open.
   Device is BLOCKED waiting to create a volume for:
       Pool:        alice-Inc-Pool-carl
       Media type:  File
   Available Space=28.15 TB
----8<----

My guess here is that you have your 'MaximumVolumes' in your pools set too low, 
and somewhere in these joblogs for jobids
63554 and 63555 writing to 'carl-sd' there will be a message about Maximum 
Volumes in pool reached.

A 'list pools' output would show us the number of volumes in each pool, and the 
maximum volumes set.

Additionally, this issue could also be related to/caused by the MediaType = 
File everywhere.


I do not see the need to set "MaximumConcurrentJobs = 63" in these Copy jobs. 
It seems pretty high, and pretty specific. I
mean, sure, you can set it, but using the PoolUncop
ied jobs setting, and kicking these copy jobs off daily (on a schedule if
you follow my recommendations), there will never be the 63 uncopied jobs that 
you calculated. :)

Not only that, but with MaximumConcurrentJobs = 63 in the carl-sd, and a 
MaximumConcurrentJObs = 20 in the carl-storage
device, you will never get past 20 concurrent jobs. :)

Ans, rather than using one device in your storages, I would configure 
Autochangers with a minimum of 10 devices in them
(heck, do 20 or 30, they are free), and set MaximumConcurrentJobs = 1 on each 
of them. This way, each device can be
reading/writing a job, and for the write jobs, there will never be more than 1 
job per volume. (see below about
MaximumVolumeJobs in pools)


Also, since you are using the "PoolUncopiedJobs" feature, I would add the 
"MaximumSpawnedJobs" setting to these Copy jobs,
and set it to '1' until you get everything ironed out so that if there are a 
lot of jobs in these pools that have not been
copied,
you do not end up spawning many Copy control jobs and having to cancel so may 
copy control jobs while you are
configuring and testing.

And I see you setting "MaximumVolumeJobs" in your pools. Personally, if I ever 
set this, I only ever set it to 1 so that I do
not have more than one job on any given volume. This makes things easier for 
cleanup when things go wrong. :)

While this setting may be useful on Tapes (maybe?), I see no reason to allow 
just a specific number of jobs on a volume
unless it is '1'. This is just a personal preference, ymmv.

No, manually creating a volume (which is supposed to automatically create 
itself) does not make the problem go away.


Wee need evidence. :)  Job logs, list pools, etc (but ONLY after you 
reconfigure everything as described above - especially
the MediaTypes)

It's possible to stop the director and run it in debug mode by redirecting its 
output.
By executing `bacula-dir -d 201 -f > /var/log/bacula/run_debug_2.log 2
&1 &` it's possible to view some more information alongside the C code.


Now we are getting into the weeds unnecessarily.

Also, you can enable debugging to a *.trace file in /opt/bacula/working (the 
default working directory) by just doing in a
bconsole session:

* setdebug level=xxx trace=1 options=tc director   (xxx = 100 is usually enough)

Then to disable debugging:

* setdebug director level=0 trace=0

Except... that results in hundreds of megabytes of logs which won't fit on this 
page. I'd like to know more what to look for..


Exactly. And consider that some of the decisions/work is being done on the SDs, 
so you would want to enable debugging on them
the same way as above, substituting 'director' with 'storage=xxxx'

One thing I found earlier was that at some point the code fails due to a 
variable `rncj` not being high enough. I looked at the C file in question, and 
it contains the following snippet;


bool inc_read_store(JCR *jcr)
{

  P(rstore_mutex);

And now you have officially jumped into the deep end, which is not necessary 
for sure. :)


I see you are running 9.6.7, which is quite old at this point. I would strongly 
urge you to upgrade to a current 11 version.
There have been a lot of feature enhancements and fixes along the way.

There is a lot to digest, here, I know. Please take a careful look at the 
recommendations I made above and let us know if
this helps.

Remember, if you are still having issues, 'status director', 'status 
storage=xxxx' and list pools, 'll joblog jobid=xxxx'
among other things are very helpful for us to troubleshoot.

And finally, Bacula Systems has a perl script that will collect your Bacula 
configurations and a bunch of info from the Linux
system it is running on which is very helpful in debugging issues.

You can download the script here:
https://www.baculasystems.com/ml/bsys_report/bsys_report.tar.gz

Then, you can just attach the "bsys report" rather than
pasting your configurations. The resulting file is just plain text,
so you can manually edit it, or use sed in-place editing to obfuscate things 
you'd rather keep private. ;)


Hope this helps!
Bill

--
Bill Arlofski
w...@protonmail.com

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Has anyone gotten copy jobs between multiple storage daemons working?

Reply via email to