In the message dated: Mon, 30 Jan 2012 19:12:11 GMT, The pithy ruminations from Martin Simmons on <Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu ps unreadable> were: => >>>>> On Wed, 25 Jan 2012 16:55:24 -0500, mark bergman said: => > => > The fact that the server rebooted between job 42856 and job 42879 makes => > me wonder if something in the fibre/SCSI initialization or bacula-sd => > startup causes tapes to be rewound but not ejected. => => The bacula-sd startup will rewind the tape, but Bacula should know about that. => => If there is a fibre/SCSI reset without Bacula knowing about it, then that => could cause it.
The problem happened again. Two tapes seem to be corrupted. (I say "seem to be" because bacula loads the correct tape, but then asks for another tape or to have that media labeled. I cannot check the suspect tapes as a backup is running, so 'btape' is unusable.) Here's a timetable of the events from the bacula-dir and mtx logs, with comments {in curly brackets}: {bacula-sd/bacula-dir server working successfully for ~8 days} {mtx log shows which volumes are loaded} 20120129-06:46:20 Doing mtx -f /dev/changer-ml6000 load 4 0 {VolumeTag=003231} 20120130-00:37:40 Doing mtx -f /dev/changer-ml6000 load 7 1 {VolumeTag=000312} 20120130-00:38:08 Device /dev/tape1-ml6000 - not ready, retrying... 20120130-00:38:09 Device /dev/tape1-ml6000 - not ready, retrying... 20120131-15:59:55 Doing mtx -f /dev/changer-ml6000 -- to get count of slots 20120131-15:59:55 Parms: /dev/changer-ml6000 slots 0 /dev/tape0-ml6000 0 20120131-16:00:05 Doing mtx -f /dev/changer-ml6000 -- to list volumes 20120131-16:00:05 Parms: /dev/changer-ml6000 list 0 /dev/tape0-ml6000 0 {no mtx log entries until 20120201-19:41:10} 01-Feb 06:45 sbia-infr-vbacula JobId 43324: Start Backup JobId 43324, Job=corfu-full.2012-02-01_06.45.00_00 01-Feb 06:45 sbia-infr-vbacula JobId 43324: Using Device "ml6000-drv1" 01-Feb 06:45 sbia-infr-vbacula JobId 43324: Committing spooled data to Volume "000312". Despooling 180,926,597 bytes ... {last successful backup to volume 000312, drive 1} 01-Feb 06:45 sbia-infr-vbacula JobId 43325: Start Backup JobId 43325, Job=archive.2012-02-01_06.45.00_01 01-Feb 06:45 sbia-infr-vbacula JobId 43325: Using Device "ml6000-drv0" 01-Feb 06:45 sbia-infr-vbacula JobId 43325: Committing spooled data to Volume "003231". Despooling 404 bytes ... {last backup before infrastructure1 shutdown} Feb 1 18:44 infrastructure1 server running bacula-sd & bacula-dir crashes Feb 1 18:47 bacula-sd starts on infrastructure 3 (HA-initiated startup) the /etc/init.d/bacula-sd script was previously modified to do the following before starting the daemon and after shutting down the daemon: unload all tapes release any SCSI reservations on each tape drive sg_prevent --allow --verbose /dev/tape0-ml6000 sg_prevent --allow --verbose /dev/tape1-ml6000 Feb 1 18:47 bacula-dir starts on infrastructure 3 (HA-initiated startup) Feb 1 18:49 infrastructure1 server starts Feb 1 18:52:25 bacula-dir stopped on infrastructure 3 (manual shutdown) Feb 1 18:52:25 bacula-sd stopped on infrastructure 3 (manual shutdown) Feb 1 18:53 bacula-sd daemon started on infrastructure1 (normal startup), PID 8083 {mtx logs show query of devices in changer} 20120201-18:53:47 Doing mtx -f /dev/changer-ml6000 0 -- to find what is loaded 20120201-18:53:47 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0 20120201-18:53:57 Doing mtx -f /dev/changer-ml6000 0 -- to find what is loaded 20120201-18:53:57 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0 Feb 1 18:54 bacula-dir daemon started on infrastructure1 (normal startup), PID 8118 {first backup begins on bacula-sd/bacula-dir server} 01-Feb 19:41 sbia-infr-vbacula JobId 43326: Start Backup JobId 43326, Job=agora-full.2012-02-01_19.41.00_02 01-Feb 19:41 sbia-infr-vbacula JobId 43326: Using Device "ml6000-drv0" {this should be empty from /etc/init.d/bacula-sd} 01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3301 Issuing autochanger "loaded? drive 0" command. 01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3302 Autochanger "loaded? drive 0", result: nothing loaded. 01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3304 Issuing autochanger "load slot 27, drive 0" command. 20120201-19:41:10 Doing mtx -f /dev/changer-ml6000 load 27 0 {there was no unload command issued by bacula...implying that the modified /etc/init.d/bacula-sd startup script unloaded the tape from slot 0} 01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3305 Autochanger "load slot 27, drive 0", status is OK. 01-Feb 19:41 sbia-infr-vbacula JobId 43326: Volume "000335" previously written, moving to end of data. 01-Feb 19:42 sbia-infr-vbacula JobId 43326: Ready to append to end of Volume "000335" at file=63. 01-Feb 19:42 sbia-infr-vbacula JobId 43326: Spooling data ... {backup to volume 000335 in slot 0 is OK} {next backup job starts, uses drive 1} 01-Feb 20:11 sbia-infr-vbacula JobId 43328: Start Backup JobId 43328, Job=home-2.2012-02-01_20.11.00_05 01-Feb 20:11 sbia-infr-vbacula JobId 43328: Using Device "ml6000-drv1" 01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3301 Issuing autochanger "loaded? drive 1" command. 01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3302 Autochanger "loaded? drive 1", result: nothing loaded. {as expected, nothing is loaded in drive 1} 01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load slot 7, drive 1" command. 20120201-20:11:10 Doing mtx -f /dev/changer-ml6000 load 7 1 {load required as the tape was unloaded by /etc/init.d/bacula-sd} 01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 7, drive 1", status is OK. 01-Feb 20:11 sbia-infr-vbacula JobId 43328: Please mount Volume "000312" or label a new one for: {bacula cannot read the label on tape 000312} {used bconsole to change status of volume 000312 to "Disabled", bacula requests, loads, and uses a different tape} 01-Feb 21:00 sbia-infr-vbacula JobId 43328: Recycled volume "000313" 01-Feb 21:00 sbia-infr-vbacula JobId 43328: Using Volume "000313" from 'Scratch' pool. 01-Feb 21:00 sbia-infr-vbacula JobId 43328: 3307 Issuing autochanger "unload slot 7, drive 1" command. 01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load slot 14, drive 1" command. 01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 14, drive 1", status is OK. 01-Feb 21:01 sbia-infr-vbacula JobId 43328: Recycled volume "000313" on device "ml6000-drv1" (/dev/tape1-ml6000), all previous data lost. 01-Feb 21:01 sbia-infr-vbacula JobId 43328: Spooling data ... {after several backups are written to tape 00313 in drive 1, bacula attempts to use 003231 for a job in a different pool} 02-Feb 06:45 sbia-infr-vbacula JobId 43347: Start Backup JobId 43347, Job=archive.2012-02-02_06.45.00_34 02-Feb 06:48 sbia-infr-vbacula JobId 43347: 3307 Issuing autochanger "unload slot 14, drive 1" command. 02-Feb 06:50 sbia-infr-vbacula JobId 43347: Using Device "ml6000-drv1" 02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3301 Issuing autochanger "loaded? drive 1" command. 02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3302 Autochanger "loaded? drive 1", result: nothing loaded. 02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3304 Issuing autochanger "load slot 4, drive 1" command. 02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3305 Autochanger "load slot 4, drive 1", status is OK. 02-Feb 06:51 sbia-infr-vbacula JobId 43347: Please mount Volume "003231" or label a new one for: Based on this, it looks like the corruption is somehow associated with some combination of the bacula-sd and bacula-dir stopping and starting or with the SCSI release command....but that was as close as I was to finding a solution earlier. I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are ejected and to timestamp the SCSI release commands. Is it possible that bacula flagged tapes 003231 and 000312 as being in the drives because they were loaded when the server crashed, even though they were later ejected (outside of bacula's control)? Could this cause bacula to believe that the tapes were at EOT when they do get loaded, and bacula then immediately begins writing (corrupting the label)? [Unlikely that bacula would try to write before reading the label, and would then read the label after corrupting the tapes.] When the current backup is finished, I'll extract the beginning data on each of 003231 and 000312. Is there anything you recommend in terms of checking the data on tape to determine whether the tape begins with random garbage (possibly caused by the shutdown, startup, scsi reset, etc.) or if it begins with valid bacula data that happened to overwrite the label instead of being appended? Does anyone have suggestions of how to troubleshoot this further, or how to make the daemon startup process more resistant to causing any corruption? Any help is greatly appreciated. Thanks, Mark => => => > Is there any other information that I can provide? Do you suggest running => > bacula-sd or bacula-dir at any particular debugging level to try to get => > more detail if this happens again? => => I suspect that are no debug messages that would show anything useful here, => because bacula-sd is probably unaware of the problem. => => If you have some way to query the actual tape position periodically then that => would be useful (I don't know if Linux has something similar, but FreeBSD's mt => status command reports this and it can be used on the control mode device even => if Bacula has the drive open). => => __Martin > ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users