[Bacula-users] possible bug with very long Job (or File?) Retention

John Jorgensen Sat, 29 Nov 2008 21:49:26 -0800

I think I've discovered a minor bug in Bacula 2.4.3 (running on
Scientific Linux 4.5, but that's probably not relevant), so I'm
following the standing suggestion to ask about it here on
bacula-users before reporting it to the bug database.


The bug can most succinctly be described by quoting the
comment accompanying the change I just made to our Client
configurations to work around the problem:

  Client {
     .
     .
     .
    Job Retention = 38 years     # Keep Jobs until Volumes recycled; we use 
                                 # "38 years" to represent "forever" because
                                 # Bacula's pruning code (at least in release
                                 # 2.4.3) does not deal well with Job or File
                                 # Retentions long enough to predate the UNIX
                                 # epoch, Jan 1, 1970.

    Autoprune = yes              # Prune expired Jobs/Files
     .
     .
     .
  }

A more verbose description of my observations, in case I've
jumped to the wrong conclusion...

I recently decided we should keep Job records for as long as we
keep the associated Volume records (the consequence of some
tedious bscan'ing).  

We have a few tapes which we intend to keep indefinitely in an
"Archive Pool", and their Volume Retention is set to "1000 years"
(because I couldn't find an official way to say "forever" in the
manual).  So I set the Job Retention to "1000 years" as well,
thinking that the Job records would then stick around in the
catalog until their Volume is recycled, or forever, if it is
never recycled.

I don't know yet if that works as I intended, but within a day of
making the change, I started seeing some waiting jobs get aborted
as soon as they were ready to proceed, with lines like this one
showing up in the log:

  28-Nov 04:33 backup-dir JobId 17913: Warning: Error updating job record. 
sql_update.c:194 Update failed: affected_rows=0 for UPDATE Job SET 
JobStatus='f',EndTime='2008-11-28 
04:33:27',ClientId=14,JobBytes=0,JobFiles=0,JobErrors=0,VolSessionId=0,VolSessionTime=0,PoolId=9,FileSetId=71,JobTDate=1227867100,RealEndTime='2008-11-28
 04:11:40',PriorJobId=0 WHERE JobId=17913

I eventually realized that the aborted jobs were always waiting
for another job running on the same client, and hypothesized that
the first job was mistakenly deleting the Job record of the
waiting job during autopruning.

So I enabled statement logging on our postgresql server and,
during a test restore which triggered the problem, saw lines like
these.  Note that the SELECTs are testing against a JobTDate
millions of years into future.  Job 17990, the deletion of whose
Job record is recorded in the log, had yet to run:

 Nov 29 16:05:41 backup postgres[24538]: [1409-1] LOG:  statement: INSERT INTO 
DelCandidates SELECT JobId,PurgedFiles,FileSetId,JobFiles,JobStatus FROM Job 
WHERE Type='R' AND
 Nov 29 16:05:41 backup postgres[24538]: [1409-2]  
JobTDate<18446744043401547957 AND ClientId=1
 Nov 29 16:05:41 backup postgres[24538]: [1410-1] LOG:  statement: SELECT 
DISTINCT DelCandidates.JobId,DelCandidates.PurgedFiles FROM Job,DelCandidates 
WHERE
 Nov 29 16:05:41 backup postgres[24538]: [1410-2]  
(Job.JobId=DelCandidates.JobId AND DelCandidates.JobStatus!='T') OR 
(Job.JobTDate>18446744043401547957 AND Job.ClientId=1 AND
 Nov 29 16:05:41 backup postgres[24538]: [1410-3]  Job.Type='R')
 Nov 29 16:05:41 backup postgres[24538]: [1411-1] LOG:  statement: DELETE FROM 
File WHERE JobId IN (17990)
 Nov 29 16:05:41 backup postgres[24538]: [1412-1] LOG:  statement: UPDATE Job 
SET PurgedFiles=1 WHERE JobId IN (17990)
 Nov 29 16:05:41 backup postgres[24538]: [1413-1] LOG:  statement: DELETE FROM 
JobMedia WHERE JobId IN (17990)
 Nov 29 16:05:41 backup postgres[24538]: [1414-1] LOG:  statement: DELETE FROM 
Log WHERE JobId IN (17990)
 Nov 29 16:05:41 backup postgres[24538]: [1415-1] LOG:  statement: DELETE FROM 
Job WHERE JobId IN (17990)
 Nov 29 16:05:41 backup postgres[24538]: [1416-1] LOG:  statement: DROP TABLE 
DelCandidates
 Nov 29 16:05:41 backup postgres[24538]: [1417-1] LOG:  statement: DROP INDEX 
DelInx1

Happily, while the first SELECT is putting essentially all jobs
into the DelCandidates list (because they all occurred before the
future date), the second SELECT protects most of the Job records
against deletion, because it needs to find a job successfully
executed _after_ the future date, unless the potential victim has
a JobStatus that does not indicate succesful completion.  Jobs
which are waiting to run, though, have a JobStatus of 'C'), so
their records are deleted.

I suspect the root cause of the issue is the use of edit_uint64
instead of edit_int64 in this part of src/ua_prune.c, which
presumably loses the negative sign resulting from subtracting
"1000 years" from "Jan 1, 1970":

   /*
    * Select all files that are older than the JobRetention
   period
    *  and stuff them into the "DeletionCandidates" table.
    */
   edit_uint64(now - period, ed1);

While I haven't actually observed any problems with File Retention
(since we can't afford the space to set that to "forever"), I
suspect it could arise, since edit_uint64 is used in these lines
in ua_prune.c:

   /* Select Jobs -- for counting */
   Mmsg(query, count_select_job, edit_uint64(now - period, ed1), 
        edit_int64(cr.ClientId, ed2));
        .
        .
        .
   /* Now process same set but making a delete list */
   Mmsg(query, select_job, edit_uint64(now - period, ed1), 
        edit_int64(cr.ClientId, ed2));

I haven't found any difficulties with Volume Retention, but won't
claim to know everywhere to look.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

[Bacula-users] possible bug with very long Job (or File?) Retention

Reply via email to