Josh,

Good catch! I didn't notice that jobs 662 and 663 started at exactly the
same time.

Your theory sounds very persuasive. I have one doubt, though: *why did job
662 write files to both volumes 250 and 251? *bls shows that 662 wrote most
of its data to volume 250, and then wrote a bunch of smaller files to
volume 251. Volume 250 didn't have an 'End Job Session' record from job
662. That record was in the first part of volume 251, after around 1000
small files. Why might job 662 have written to two volumes?

Some thoughts on workarounds: I am not sure that I actually need to use the
'MaxVolumeJobs' option. Really, my goal is to separate jobs ran around a
certain time from jobs ran at other times, to make it easier to recycle
volumes. *Maybe putting 'VolumeUseDuration = 20 hours' in the relevant
pools could achieve the same thing,* with multiple jobs of the same
type/pool (Inc, Diff, Full) each going into their own volume(s) for the
time period specified? In other words, I wonder if I would have encountered
an error if the ideal volume for each of the two jobs in the race condition
was the same. I don't have deep knowledge of how this part of bacula works,
so *perhaps this would just create a different problem*.

Indeed, *the manual suggests that this option could cause other problems if
jobs were still writing to the volume when the volume use duration expired.*
>From the manual:
"Be careful about setting the duration to short periods such as 23 hours,
or you might experience
problems of Bacula waiting for a tape over the weekend only to complete the
backups
Monday morning when an operator mounts a new tape.
The use duration is checked and the Used status is set only at the end of a
job that
writes to the particular volume, which means that even though the use
duration may have
expired, the catalog entry will not be updated until the next job that uses
this volume is
run. This directive is not intended to be used to limit volume sizes and
may not work as
expected (i.e. will fail jobs) if the use duration expires while multiple
simultaneous jobs
are writing to the volume."

*I don't have a reason to limit the number of volume jobs OR the volume use
duration*, except that *I want to be able to recycle volumes promptly.*
When I first started this configuration I had not set up bacula cloud copy
jobs yet, so was considering things like running rsync jobs to copy my
volumes off the local storage to somewhere else. Now that I have bacula
cloud copy jobs properly set up, I have no need to limit volumes in this
way.

*I wonder if the simplest, least invasive way to work around this problem
might be* to follow advice I've seen elsewhere: don't try to micromanage
the bacula volumes, and let bacula take care of that for me. I'm guessing
that the default configuration *limiting volumes to a certain reasonable
size* (*MaximumVolumeBytes = xxG*) should accomplish this for me (perhaps
some value that should be filled within a week, but won't result in many
volumes per daily job?). Even when using MaximumVolumeBytes, I think it
could be theoretically possible for this same sort of issue to occur. It's
probably much less likely that we would encounter this problem because not
only do we need to have a race condition like what we think occurred
between jobs 662 and 663, but one of the race condition jobs ALSO has to
fill up one of the volumes, resulting in its status changing to full. There
could also be other code that deals with volumes being full, though I'm not
sure how that's handled or if the result would be different.

The number of jobs for this bacula instance isn't very high, so giving them
different priorities is a minor pain at most. I would think that should
definitely work around the problem, though ideally I would use a solution
that doesn't necessitate micromanaging things.

Regards,
Robert Gerber
402-237-8692
r...@craeon.net


On Wed, Mar 26, 2025 at 9:04 AM Josh Fisher <jfis...@jaybus.com> wrote:

>
> On 3/25/25 14:35, Rob Gerber wrote:
>
> Josh,
>
> Here you go. Thank you!
>
> *My Synology-Local autochanger and associated devices from bacula-sd.conf
> file:*
>
> ...
>
>
> OK. That looks like the usual autochanger config.
>
> Looking at the log of the jobs starting, note that:
>
> *Joblogs from jobs 662 and 663 (copied directly out of bacula.log):*
> 20-Mar 23:05 td-bacula-dir JobId 662: Start Backup JobId 662,
> Job=Backup-win11-base-fd-job.2025-03-20_23.05.01_40
> ...
>
> 20-Mar 23:05 td-bacula-dir JobId 663: Start Backup JobId 663,
> Job=Backup-akita-job.2025-03-20_23.05.01_41
>
>
>
> those jobs started simultaneously.
>
> I believe it is a race condition. Each job, at startup, is assigned a
> device, in this case an autochanger drive. Then each job selects a volume
> to write on. The drive selection is handled atomically, and each volume
> selection is handled atomically, however, if two jobs start simultaneously,
> then one job wins and gets to select a volume first. So, after both jobs
> had selected a device, it went something like this:
>
> - Both job 662 and 663 are in queue waiting to select a volume
> - Job 662 wins and enters atomic volume selection. Since all volumes are
> used only once, it creates Synology-Local-Inc-250.
> - Job 662 leaves atomic volume selection and job 663 enters.
> - Job 663 now sees a new volume Synology-Local-Inc-250 ready to be written
> to and selects it
> - Job 662 mounts Synology-Local-Inc-250 in its device and changes the
> volume status to Used
> - Job 663 attempts to mount Synology-Local-Inc-250, but sees that it is
> Used, Logs the error, then re-enters atomic volume selection and creates S
> ynology-Local-Inc-251
>
> It happens seemingly randomly because it depends on the timing. Sometimes
> the first job already has the volume marked as used BEFORE the other job
> enters atomic volume selection, and then it works as expected.
>
> The easy fix (to the code) is likely to do the volume status change before
> leaving the atomic volume selection whenever the max volume jobs count is
> reached.
>
> The workaround is to give each job a different priority or else stagger
> the job start times in the Job definitions for each job. Of course, that is
> a pain if there are a lot of jobs.
>
>
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to