Re: [Bacula-users] Volume prune and truncation best practices

Martin Simmons Sat, 20 Sep 2025 02:50:21 -0700

I've not tried it, but it looks like you can detect job pruning by including
"events" in the list of items handled in the Messages resource.  That will
make it record a "purge jobid=..." event when it happens.


https://www.bacula.org/15.0.x-manuals/en/main/New_Features_in_11_0_0.html#SECTION004500000000000000000

__Martin


>>>>> On Thu, 18 Sep 2025 14:15:55 -0500, Rob Gerber said:
> 
> Martin,
> 
> I'm sure you're right. That was only my theory, because at the time when I
> discovered this mess, I found multiple
> 
> Something strange was definitely happening. I can't imagine why a job would
> be pruned from the cloud, necessitating another copy, when the local job
> was both older, and (should have) had the same retention period.
> 
> Is there any way to tell when a job was pruned?
> 
> Robert Gerber
> 402-237-8692
> [email protected]
> 
> On Thu, Sep 18, 2025, 11:18 AM Martin Simmons <[email protected]> wrote:
> 
> > Something I don't understand in your description below is that you say "The
> > new copy job would run against the old volume" but you are using
> > PoolUncopiedJobs, which looks in the Job table when deciding what to
> > copy.  It
> > knows nothing about the volumes, so maybe the original jobs still existed
> > in
> > the Job table?
> >
> > PoolUncopiedJobs does a query like this:
> >
> >     SELECT DISTINCT Job.JobId,Job.StartTime FROM Job,Pool
> >     WHERE Pool.Name = '%s' AND Pool.PoolId = Job.PoolId
> >     AND Job.Type = 'B' AND Job.JobStatus IN ('T','W')
> >     AND Job.JobBytes > 0
> >     AND Job.JobId NOT IN
> >     (SELECT PriorJobId FROM Job WHERE
> >     Type IN ('B','C') AND Job.JobStatus IN ('T','W')
> >     AND PriorJobId != 0)
> >     ORDER by Job.StartTime;
> >
> > where %s is the name of the pool.
> >
> > It should print a message like:
> >
> > "The following ... JobIds chosen to be copied:"
> >
> > followed by a list of JobIds.
> >
> > __Martin
> >
> >
> > >>>>> On Wed, 17 Sep 2025 22:36:31 -0500, Rob Gerber said:
> > >
> > > I have a bacula 15.0.2 instance running in a rocky linux 9 vm on a
> > synology
> > > appliance. Backups are stored locally on the synology appliance. Local
> > > volumes are copied offsite to a backblaze b2 s3-compatible endpoint using
> > > copy job selection type PoolUncopiedJobs. For ransomware protection,
> > object
> > > locking is enabled on the b2 account. Because of the focus on cloud
> > storage
> > > as a secondary backup destination, I have enforced 'jobs per volume = 1'
> > > both locally and in the cloud. I have 3 cloud pools / buckets, for Full,
> > > Diff, and Inc backups. Each has their own retention and object lock
> > period.
> > > Backblaze's file lifecycle settings are configured to only keep the most
> > > recent copy of the file, with the retrospectively obvious exception that
> > it
> > > will respect the object lock period before removing previous copies of a
> > > file. Remember that last fact, as it is going to be important for later.
> > :(
> > >
> > > Recently, we've had some problems where old jobs were being copied into
> > the
> > > cloud repeatedly. With object locking enabled, you can imagine how
> > > unpleasant that could become.
> > >
> > > Here's my understanding of what was happening, based on my review of
> > > bacula.log:
> > > As we reached the end of retention for full backups (for the first time),
> > > the older full backup volumes weren't being pruned. The old local job
> > would
> > > be deleted, but when the copy jobs launched they would detect those old
> > > full jobs as a job that hadn't yet been copied (because the old copy job
> > > was pruned). The new copy job would run against the old volume (I'm
> > > guessing this is how it worked, since the jobs were certainly pruned),
> > and
> > > the local volume for these old fulls would be copied offsite... again.
> > The
> > > recently copied job would be pruned again (because it inherited retention
> > > from its source job). Later, my copy jobs would run again. More copy jobs
> > > would spawn. My review of the bacula.log file showed that this had been
> > > occurring sporadically since August, with the issue rapidly escalating in
> > > September. For September, I believe it was basically copying expired
> > fulls,
> > > uploading them over the course of a couple days, pruning the recently
> > > uploaded expired fulls, then launching new copy jobs to begin the noxious
> > > cycle all over again that evening. Backblaze was allowing the deletion of
> > > truncated part files as the cloud volumes were reused in the cycle, but
> > was
> > > keeping all past copies of the part files that hadn't yet reached the
> > > object lock duration.
> > >
> > > I previously ran a number of test full backups (In June, I believe), some
> > > of which might not have been deleted, and some of which may have been
> > > uploaded to the B2 cloud.
> > >
> > > I suspect that some of those jobs recently fell off, and the jobs were
> > then
> > > pruned. The volumes weren't pruned and truncated, and because I had more
> > > volumes than numClients*numTimesExpectedToRunWithinRetentionPeriod, the
> > > excess volumes weren't promptly truncated by bacula for use in newer
> > jobs.
> > > I'm sure some of the excess volumes WERE truncated and reused, but not
> > all
> > > of them. At least 1 or 2 of the original 'full' backup volumes did
> > remain,
> > > both in bacula's database, and on disk. These jobs were subsequently
> > > re-copied up to B2 cloud storage multiple times.
> > >
> > > I've created scripts and admin jobs to prune all pools (prune allpools
> > yes,
> > > then select '3' for volumes) with job priority 1 / schedule daily, and
> > then
> > > truncate all volumes for each storage (truncate allpools storage=$value),
> > > with priority 2 / schedule daily. My goal is to ensure that no copy job
> > can
> > > run against an expired volume. I also want to remove expired volumes from
> > > cloud storage as quickly as possible.
> > >
> > > I expect this has probably solved my problem, with the exception that I
> > > have now created a new problem, where a task runs before any of my
> > backups
> > > with the express purpose of destroying any expired backups. If backups
> > > haven't ran successfully for some time (which will be a colossal failure
> > of
> > > monitoring, to be sure), then it's possible for bacula to merrily chew
> > all
> > > the past backups up, leaving us with nothing. As it stands, 3 months of
> > > full backups should be extant before anything reaches its expiry period,
> > > and is then purged. Still, I don't like having a hungry monster chewing
> > up
> > > old backups at the end of my backup trail, at least not without some sort
> > > of context sensitive monitoring. I don't know how a script could possibly
> > > monitor the overall health of bacula's backups and know whether or not it
> > > is dangerous to proceed. I suspect it cannot.
> > >
> > > Further remediation / prevention:
> > > I have enabled data storage caps for our B2 cloud account. I can't set
> > caps
> > > where I want them because our current usage is higher than that level,
> > but
> > > I have set them to prevent substantial further growth. I've set a
> > calendar
> > > reminder in 3 months to lower the cap to 2x our usual usage.
> > > With Backblaze B2, I cannot override the object lock unless I delete the
> > > account. I am prepared to do that if the usage were high enough, but I
> > > think it's probably better to just ride this out for the few months it
> > > would require to wait for these backups to fall off, vs deleting the
> > > account, recreating the account, losing our cloud backups, recopying all
> > > the local volumes to the cloud, and crucially, all the object lock
> > periods
> > > would be incorrect, as the object lock periods would be just getting
> > > started, but the jobs would be partway through their retention period.
> > >
> > > My questions:
> > > What are the best practices around the use of prune and truncate scripts
> > > like those I've deployed?
> > > Does my logic about the cause of all my issues seem reasonable?
> > > Any ideas or tips to prevent annoying problems like this in the future?
> > > Is my daily prune / truncate script a horrible, horrible idea that is
> > going
> > > to destroy everything I love?
> > > Any particular way to handle this differently, perhaps more safely? It
> > does
> > > occur to me that I could delete every volume with status 'purged',
> > bringing
> > > my volume count much closer to 'number of volumes I usually need within
> > my
> > > retention period', thereby improving the odds that routine operation will
> > > truncate any previously expired volumes, without the need for a routine
> > > truncate script. Not sure if this would really make any substantial
> > > difference for my 'somehow we have failed to monitor bacula and it has
> > > chewed its entire tail off while we weren't looking' scenario.
> > >
> > > My bacula-dir.conf, selected portions:
> > > Job {
> > >   Name = "admin-prune-allpools-job"   # If we don't remove expired
> > volumes,
> > > we will fall into an infinite copy loop
> > >   Type = admin                        # where copy jobs keep running
> > > against random expired local jobs, whose volumes should
> > >   Level = Full                        # have been pruned already.
> > >   Schedule = "Daily"                  # Step 1: prune, so volumes can be
> > > truncated.
> > >   Storage = "None"                    # step 2: truncate, so volumes are
> > > reduced to their minimum size on disk.
> > >   Fileset = "None"
> > >   Pool = "None"
> > >   JobDefs = "Synology-Local"
> > >   Runscript {
> > >      RunsWhen = before
> > >      RunsOnClient = no  # Default yes, there is no client in an Admin
> > job &
> > > Admin Job RunScripts *only* run on the Director :)
> > >      Command = "/opt/bacula/etc/prune-allpools.sh" # This can be
> > `Console`
> > > if you wish to send console commands, but don't use this*
> > >   }
> > >   Priority = 1
> > > }
> > >
> > > Job {
> > >   Name = "admin-truncate-volumes-job"   # If we don't remove expired
> > > volumes, we will fall into an infinite copy loop
> > >   Type = admin                          # where copy jobs keep running
> > > against random expired local jobs, whose volumes should
> > >   Level = Full                          # have been pruned already.
> > >   Schedule = "Daily"                    # Step 1: prune, so volumes can
> > be
> > > truncated.
> > >   Storage = "None"                      # step 2: truncate, so volumes
> > are
> > > reduced to their minimum size on disk.
> > >   Fileset = "None"
> > >   Pool = "None"
> > >   JobDefs = "Synology-Local"
> > >   Runscript {
> > >      RunsWhen = before
> > >      RunsOnClient = no  # Default yes, there is no client in an Admin
> > job &
> > > Admin Job RunScripts *only* run on the Director :)
> > >      Command = "/opt/bacula/etc/truncate-volumes.sh" # This can be
> > > `Console` if you wish to send console commands, but don't use this*
> > >   }
> > >   Priority = 2
> > > }
> > >
> > >
> > > Job {
> > >   Name = "admin-copy-control-launcher-job"   # Launching copy control
> > jobs
> > > from a script so the 'pooluncopiedjobs' lookup will be fresh as of when
> > the
> > > daily backups have completed.
> > >   Type = admin                               # Otherwise the jobs
> > selected
> > > for copy will date from when the copy control job was queued.
> > >   Level = Full
> > >   Schedule = "Copy-End-Of-Day"
> > >   Storage = "None"
> > >   Fileset = "None"
> > >   Pool = "None"
> > >   JobDefs = "Synology-Local"
> > >   Runscript {
> > >      RunsWhen = before
> > >      RunsOnClient = no  # Default yes, there is no client in an Admin
> > job &
> > > Admin Job RunScripts *only* run on the Director :)
> > >      Command = "/opt/bacula/etc/copy-control-launcher.sh" # This can be
> > > `Console` if you wish to send console commands, but don't use this*
> > >   }
> > >   Priority = 20
> > > }
> > >
> > > Job {
> > >   Name = "Copy-Control-Full-job"   # Launching a different copy control
> > job
> > > for each backup level to prevent copy control jobs with different pools
> > > from being cancelled as duplicates.
> > >   Type = Copy
> > >   Level = Full
> > >   Client = td-bacula-fd
> > >   Schedule = "Manual"
> > >   FileSet = "None"
> > >   Messages = Standard
> > >   Pool = Synology-Local-Full
> > >   Storage = "Synology-Local"
> > >   Maximum Concurrent Jobs = 4
> > >   Selection Type = PoolUncopiedJobs
> > >   Priority = 21
> > >   JobDefs = "Synology-Local"
> > > }
> > >
> > > Job {
> > >   Name = "Backup-delegates-cad1-job"
> > >   Level = "Incremental"
> > >   Client = "delegates-cad1-fd"
> > >   Fileset = "Windows-All-Drives-fs"
> > >   Storage = "Synology-Local"
> > >   Pool = "Synology-Local-Inc"
> > >   JobDefs = "Synology-Local"
> > >   Priority = 13
> > > }
> > >
> > > JobDefs {
> > >   Name = "Synology-Local"
> > >   Type = "Backup"
> > >   Level = "Incremental"
> > >   Messages = "Standard"
> > >   AllowDuplicateJobs = no # We don't want duplicate jobs. What action is
> > > taken is determined by the variables below.
> > >                           # See flowchart Figure 23.2 in Bacula 15.x Main
> > > manual, probably page 245 in the PDF.
> > >   CancelLowerLevelDuplicates = yes # If a lower level job (example: inc)
> > is
> > > running or queued and a higher level job (Example: diff or full) is added
> > > to the queue, then the lower level job will be cancelled.
> > >   CancelQueuedDuplicates = yes # This will cancel any queued duplicate
> > jobs.
> > >   Pool = "Synology-Local-Inc"
> > >   FullBackupPool = "Synology-Local-Full"
> > >   IncrementalBackupPool = "Synology-Local-Inc"
> > >   DifferentialBackupPool = "Synology-Local-Diff"
> > >   Client = "td-bacula-fd"
> > >   Fileset = "Windows-All-Drives-fs"
> > >   Schedule = "Daily"
> > >   WriteBootstrap = "/mnt/synology/bacula/BSR/%n.bsr"
> > >   MaxFullInterval = 30days
> > >   MaxDiffInterval = 7days
> > >   SpoolAttributes = yes
> > >   Priority = 10
> > >   ReRunFailedLevels = yes # (if previous full or diff failed, current job
> > > will be upgraded to match failed job's level). a failed job is defined as
> > > one that has not terminated normally, which includes any running job of
> > the
> > > same name. Cannot allow duplicate queued jobs. Will also trigger on
> > fileset
> > > changes, regardless of whether you used 'ignorefilesetchanges'.
> > >   RescheduleOnError = no
> > >   Accurate = yes
> > > }
> > >
> > >
> > > Schedule {
> > >   Name = "Daily"
> > >   Run = at 20:05
> > > }
> > > Schedule {
> > >   Name = "Copy-End-Of-Day"
> > >   Run = at 23:51
> > > }
> > >
> > > Pool {
> > >   Name = "Synology-Local-Full"
> > >   Description = "Synology-Local-Full"
> > >   PoolType = "Backup"
> > >   NextPool = "B2-TD-Full"
> > >   LabelFormat = "Synology-Local-Full-"
> > >   LabelType = "Bacula"
> > >   MaximumVolumeJobs = 1
> > > #  MaximumVolumeBytes = 50G
> > > #  MaximumVolumes = 100
> > >   Storage = "Synology-Local"
> > >   ActionOnPurge=Truncate
> > >   FileRetention = 95days
> > >   JobRetention = 95days
> > >   VolumeRetention = 95days
> > > }
> > >
> > > Pool {
> > >   Name = "B2-TD-Full"
> > >   Description = "B2-TD-Full"
> > >   PoolType = "Backup"
> > >   LabelFormat = "B2-TD-Full-"
> > >   LabelType = "Bacula"
> > >   MaximumVolumeJobs = 1
> > >   FileRetention = 95days
> > >   JobRetention = 95days
> > >   VolumeRetention = 95days
> > >   Storage = "B2-TD-Full"
> > >   CacheRetention = 1minute
> > >   ActionOnPurge=Truncate
> > > }
> > >
> > > Regards,
> > > Robert Gerber
> > > 402-237-8692
> > > [email protected]
> > >
> >
> 


_______________________________________________
Bacula-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Volume prune and truncation best practices

Reply via email to