Hi All, First, I want to say thanks to all the Bacula developers for doing such a great job. Second this email is going to get a bit long as I'm trying to include all the relevant details ... sorry. The subject is an accurate executive summary if you need one :-)
Background: I upgraded from 2.4.3 to 3.0.1 for director and SD last week (on the same Scientific Linux 4.2 machine). Clients are at everything from 1.38.11 upwards (though the "easy-to-get-to" ones are now 3.0.0 or 3.0.1 :-). Tape is an Overland Neo 2000 with a single HP LTO3 drive. Let me state now that, yes, I know I have a dodgy tape in my autoloader. Looking back at the logs I can now see that the last time this tape (MHB107L3) was full it had an error (DIR and SD were 2.4.3 at the time): 28-Mar 23:10 beunoeth-sd JobId 38837: Committing spooled data to Volume "MHB107L3". Despooling 46,841,852,967 bytes ... 28-Mar 23:32 beunoeth-sd JobId 38837: Error: block.c:568 Write error at 359:15499 on device "LTO3" (/dev/nst0). ERR=Input/output error. 28-Mar 23:33 beunoeth-sd JobId 38837: Error: Re-read of last block OK, but block numbers differ. Last block=471331 Current block=473270. 28-Mar 23:33 beunoeth-sd JobId 38837: End of medium on Volume "MHB107L3" Bytes=359,976,895,488 Blocks=5,579,998 at 28-Mar-2009 23:33. 28-Mar 23:33 beunoeth-sd JobId 38837: 3307 Issuing autochanger "unload slot 8, drive 0" command. Of course, I didn't notice this error as we have approximately 60 jobs (and I'm far too lazy to carefully check all the emails) AND the tape was not marked as Error status in the catalog AND the job didn't fail. This weekend that tape was recycled and reused and the following log was produced for one of the jobs writing to it: 04-Jul 06:00 beunoeth-dir JobId 43745: Start Backup JobId 43745, Job=SkaroASTROWORLD.2009-07-04_06.00.00_37 04-Jul 06:02 beunoeth-dir JobId 43745: Using Device "LTO3" 04-Jul 06:04 beunoeth-sd JobId 43745: Spooling data ... 04-Jul 13:05 beunoeth-sd JobId 43745: User specified spool size reached. 04-Jul 13:05 beunoeth-sd JobId 43745: Writing spooled data to Volume. Despooling 50,000,033,355 bytes ... 04-Jul 15:35 beunoeth-sd JobId 43745: Despooling elapsed time = 01:14:39, Transfer rate = 11.16 M bytes/second 04-Jul 15:38 beunoeth-sd JobId 43745: Spooling data again ... 04-Jul 19:37 beunoeth-sd JobId 43745: User specified spool size reached. 04-Jul 19:37 beunoeth-sd JobId 43745: Writing spooled data to Volume. Despooling 50,000,033,735 bytes ... 04-Jul 21:03 beunoeth-sd JobId 43745: Despooling elapsed time = 01:25:21, Transfer rate = 9.763 M bytes/second 04-Jul 21:05 beunoeth-sd JobId 43745: Spooling data again ... 05-Jul 00:51 beunoeth-sd JobId 43745: User specified spool size reached. 05-Jul 00:51 beunoeth-sd JobId 43745: Writing spooled data to Volume. Despooling 50,000,033,476 bytes ... 05-Jul 04:13 beunoeth-sd JobId 43745: Error: block.c:568 Write error at 397:11999 on device "LTO3" (/dev/nst0). ERR=Input/output error. 05-Jul 04:14 beunoeth-sd JobId 43745: Fatal error: Re-read of last block: block numbers differ by more than one. Probable tape misconfiguration and data loss. Read block=288633 Want block=290548. 05-Jul 04:14 beunoeth-sd JobId 43745: Fatal error: spool.c:298 Fatal append error on device "LTO3" (/dev/nst0): ERR=dev.c:474 Unable to open device "LTO3" (/dev/nst0): ERR=No medium found 05-Jul 04:14 beunoeth-sd JobId 43745: Despooling elapsed time = 00:28:22, Transfer rate = 29.37 M bytes/second 05-Jul 04:15 beunoeth-sd JobId 43745: Fatal error: fd_cmds.c:181 FD command not found: 05-Jul 04:15 beunoeth-sd JobId 43745: Job write elapsed time = 22:11:39, Transfer rate = 1.875 M bytes/second 05-Jul 07:49 skaro-fd JobId 43745: Fatal error: backup.c:944 Network send error to SD. ERR=Broken pipe 05-Jul 07:49 beunoeth-sd JobId 43745: Fatal error: fd_cmds.c:170 Command error with FD, hanging up. Append data error. 05-Jul 07:49 beunoeth-dir JobId 43745: Error: Bacula beunoeth-dir 3.0.1 (30Apr09): 05-Jul-2009 07:49:46 Build OS: i686-pc-linux-gnu redhat 4.2 JobId: 43745 Job: SkaroASTROWORLD.2009-07-04_06.00.00_37 Backup Level: Full Client: "skaro-fd" 3.0.0 (06Apr09) i386-portbld-freebsd7.2,freebsd,7.2-STABLE FileSet: "SkaroASTROWORLD" 2006-11-08 18:53:33 Pool: "MonthlyPool" (From Run pool override) Catalog: "MyCatalog" (From Client resource) Storage: "Overland" (From Job resource) Scheduled time: 04-Jul-2009 06:00:00 Start time: 04-Jul-2009 06:02:14 End time: 05-Jul-2009 07:49:46 Elapsed time: 1 day 1 hour 47 mins 32 secs Priority: 10 FD Files Written: 156,357 SD Files Written: 156,357 FD Bytes Written: 149,802,789,209 (149.8 GB) SD Bytes Written: 149,844,100,532 (149.8 GB) Rate: 1613.4 KB/s Software Compression: 26.1 % VSS: no Encryption: no Accurate: no Volume name(s): MHB113L3|MHB107L3 Volume Session Id: 208 Volume Session Time: 1246395161 Last Volume Bytes: 151,181,549,568 (151.1 GB) Non-fatal FD errors: 0 SD Errors: 1 FD termination status: Error SD termination status: Error Termination: *** Backup Error *** 05-Jul 07:49 beunoeth-dir JobId 43745: Rescheduled Job SkaroASTROWORLD.2009-07-04_06.00.00_37 at 05-Jul-2009 07:49 to re-run in 3000 seconds (05-Jul-2009 08:39). Due to old / slow server hardware and a rubbish network, we use (up to 6) concurrent jobs spooling 50 Gb at a time to keep the tape heads moving as much as possible - so jobs on tapes are interleaved. Jobs vary in size from a few to 500+ Gb and according to the catalog there are parts of 11 jobs written to this tape. BTW: I'm using "Reschedule On Error" and "Rerun Failed Levels". As you can see I've (potentially) lost a large part of this weekend's backups (but hay, I have a bad tape :-( what do I expect). I'm now waiting until the resubmitted job finishes before I can check the tape for read errors. Now to the questions: * Back in March when the first error occurred ("Error: Re-read of last block OK, but block numbers differ ... ") why did the tape not get changed to status "Error" in the catalog and the job fail ? (surely there's potential data loss here) [this may have been fixed since 2.4.3]. * Why do errors that produce ominous sounding log messages like "Probable tape misconfiguration and data loss" not mark the affected tape as error in the catalog ? * Why did the error message not say which tape had the I/O error (it took me a while to figure out which tape was actually loaded at "05-Jul 04:13"). * Why did the job finally fail with a "Network send" error from the FD after there was an I/O error from the SD three and a half hours earlier ? (something is not behaving well on hardware errors). Finally, I want to stress that I'm not moaning - I payed nothing for this software - and I am really happy with the value for money (it's pulled me out of a number of large holes already :-) ... and I appreciate how complex the SD code must be and that this is a corner-case, etc. ... and if I was a programmer I'd be submitting patches :-) Cheers ... -- ---------------------------------------------------------------------- Richard Frewin | Email: richard.fre...@astro.cf.ac.uk IT Systems Manager | PHYSX Computing Support Team | Tel: +44 (0)29 208 75282 Cardiff University | Fax: +44 (0)29 208 74056 The Parade | Cardiff CF24 3AA | http://www.astro.cardiff.ac.uk/ ______________________________________________________________________ "When the straw breaks the camel's back, blood stains the sand.." - Roaring Jelly ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users