I think I've discovered a minor bug in Bacula 2.4.3 (running on Scientific Linux 4.5, but that's probably not relevant), so I'm following the standing suggestion to ask about it here on bacula-users before reporting it to the bug database.
The bug can most succinctly be described by quoting the comment accompanying the change I just made to our Client configurations to work around the problem: Client { . . . Job Retention = 38 years # Keep Jobs until Volumes recycled; we use # "38 years" to represent "forever" because # Bacula's pruning code (at least in release # 2.4.3) does not deal well with Job or File # Retentions long enough to predate the UNIX # epoch, Jan 1, 1970. Autoprune = yes # Prune expired Jobs/Files . . . } A more verbose description of my observations, in case I've jumped to the wrong conclusion... I recently decided we should keep Job records for as long as we keep the associated Volume records (the consequence of some tedious bscan'ing). We have a few tapes which we intend to keep indefinitely in an "Archive Pool", and their Volume Retention is set to "1000 years" (because I couldn't find an official way to say "forever" in the manual). So I set the Job Retention to "1000 years" as well, thinking that the Job records would then stick around in the catalog until their Volume is recycled, or forever, if it is never recycled. I don't know yet if that works as I intended, but within a day of making the change, I started seeing some waiting jobs get aborted as soon as they were ready to proceed, with lines like this one showing up in the log: 28-Nov 04:33 backup-dir JobId 17913: Warning: Error updating job record. sql_update.c:194 Update failed: affected_rows=0 for UPDATE Job SET JobStatus='f',EndTime='2008-11-28 04:33:27',ClientId=14,JobBytes=0,JobFiles=0,JobErrors=0,VolSessionId=0,VolSessionTime=0,PoolId=9,FileSetId=71,JobTDate=1227867100,RealEndTime='2008-11-28 04:11:40',PriorJobId=0 WHERE JobId=17913 I eventually realized that the aborted jobs were always waiting for another job running on the same client, and hypothesized that the first job was mistakenly deleting the Job record of the waiting job during autopruning. So I enabled statement logging on our postgresql server and, during a test restore which triggered the problem, saw lines like these. Note that the SELECTs are testing against a JobTDate millions of years into future. Job 17990, the deletion of whose Job record is recorded in the log, had yet to run: Nov 29 16:05:41 backup postgres[24538]: [1409-1] LOG: statement: INSERT INTO DelCandidates SELECT JobId,PurgedFiles,FileSetId,JobFiles,JobStatus FROM Job WHERE Type='R' AND Nov 29 16:05:41 backup postgres[24538]: [1409-2] JobTDate<18446744043401547957 AND ClientId=1 Nov 29 16:05:41 backup postgres[24538]: [1410-1] LOG: statement: SELECT DISTINCT DelCandidates.JobId,DelCandidates.PurgedFiles FROM Job,DelCandidates WHERE Nov 29 16:05:41 backup postgres[24538]: [1410-2] (Job.JobId=DelCandidates.JobId AND DelCandidates.JobStatus!='T') OR (Job.JobTDate>18446744043401547957 AND Job.ClientId=1 AND Nov 29 16:05:41 backup postgres[24538]: [1410-3] Job.Type='R') Nov 29 16:05:41 backup postgres[24538]: [1411-1] LOG: statement: DELETE FROM File WHERE JobId IN (17990) Nov 29 16:05:41 backup postgres[24538]: [1412-1] LOG: statement: UPDATE Job SET PurgedFiles=1 WHERE JobId IN (17990) Nov 29 16:05:41 backup postgres[24538]: [1413-1] LOG: statement: DELETE FROM JobMedia WHERE JobId IN (17990) Nov 29 16:05:41 backup postgres[24538]: [1414-1] LOG: statement: DELETE FROM Log WHERE JobId IN (17990) Nov 29 16:05:41 backup postgres[24538]: [1415-1] LOG: statement: DELETE FROM Job WHERE JobId IN (17990) Nov 29 16:05:41 backup postgres[24538]: [1416-1] LOG: statement: DROP TABLE DelCandidates Nov 29 16:05:41 backup postgres[24538]: [1417-1] LOG: statement: DROP INDEX DelInx1 Happily, while the first SELECT is putting essentially all jobs into the DelCandidates list (because they all occurred before the future date), the second SELECT protects most of the Job records against deletion, because it needs to find a job successfully executed _after_ the future date, unless the potential victim has a JobStatus that does not indicate succesful completion. Jobs which are waiting to run, though, have a JobStatus of 'C'), so their records are deleted. I suspect the root cause of the issue is the use of edit_uint64 instead of edit_int64 in this part of src/ua_prune.c, which presumably loses the negative sign resulting from subtracting "1000 years" from "Jan 1, 1970": /* * Select all files that are older than the JobRetention period * and stuff them into the "DeletionCandidates" table. */ edit_uint64(now - period, ed1); While I haven't actually observed any problems with File Retention (since we can't afford the space to set that to "forever"), I suspect it could arise, since edit_uint64 is used in these lines in ua_prune.c: /* Select Jobs -- for counting */ Mmsg(query, count_select_job, edit_uint64(now - period, ed1), edit_int64(cr.ClientId, ed2)); . . . /* Now process same set but making a delete list */ Mmsg(query, select_job, edit_uint64(now - period, ed1), edit_int64(cr.ClientId, ed2)); I haven't found any difficulties with Volume Retention, but won't claim to know everywhere to look. ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users