On Sunday 16 April 2006 20:23, Wolfgang Denk wrote: > In message <[EMAIL PROTECTED]> you wrote: > > Then it sounds to me more like a bacula issue rather than the SCSI tape > > driver. > > I disagree. We get pretty clear SCSI error messages (unexpected > disconnect). No matter what a user application does, the SCSI driver > must never run into such a situation. This is a SCSI driver problem. > > > A problem in diagnosing it is that it is not reproducible. This could > > indicate a > > The problem *is* reproducable. For me it happens pretty reliably. The > problem is that it takes a loooooong time - typicly hours. And I have > to admit that I didn't find (or take) the time to really start > debugging it. Probably raising debug levels for the SCSI system would > be a good start, but I'm not convinced. > > BTW: I wrote befor that this happens without spooling only; this was > wrong. Scanning the logs I've seen cases of this problem when > spooling was active, too. > > > timing issue as you've pointed out so if a trace is set up to catch the > > villain the incident may not occur at all. What can we do? > > Let's summarize the observed symptoms again: > > * On user level we see error messages like these: > > Error: block.c:538 Write error at 39:5706 on device "SLR100" > (/dev/nst0). > ERR=Input/output error. Error: Error writing final EOF to tape. This Volume > may not be readable. dev.c:1536 ioctl MTWEOF error on "SLR100" (/dev/nst0). > ERR=Input/output error. > > * On system level we see error messages like these: > > sym0: unexpected disconnect > st0: Error 700ff (sugg. bt 0x0, driver bt 0x0, host bt 0x7). > sym0: unexpected disconnect > st0: Error 700ff (sugg. bt 0x0, driver bt 0x0, host bt 0x7). > st0: Error with sense data: <6>st0: Current: sense key: Unit Attention > Additional sense: Power on, reset, or bus device reset occurred > > * It happens with different types of tape drives; for me with a SLR60 > driver and 3 x SLR100 autoloaders. > > * It happens with different types of SCSI controllers; for me with: > - LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI > - Adaptec aic7899 Ultra160 SCSI adapter > - Adaptec AHA-2940UW Ultra SCSI adapter > - Dawicontrol DC-29160 Ultra160 SCSI adapter > > * It happens long before the tape is actually full. > > * I never had any other kinds of I/O errors, only this "Error writing > final EOF"; this boils down to a MTIOCTOP ioctl() with op=MTWEOF > and count=1 - and this is probably the major difference to all > other tape tests I've tried: none of the other tools I use to write > to a tape (like tar etc.) actually write an EOF themself; they just > close the tape device at the end of the write operations.
All your reasoning is absolutely perfect up to this previous point. In looking at the Bacula error messages that you list above, it is always an I/O error writing a Bacula block that produces the problem. Once Bacula gets an I/O error, it terminates the tape by writing an EOF, and if it gets an error writing this EOF, it reports it along with the message saying that the tape may not be readable. If the EOF is correctly written it will not print the "this tape may not be readable" message. IMO, the source problem is coming when writing the buffers (a write() request) and not subsequent ioctl(WEOF). Also, between the write() that fails and the ioctl(WEOF), Bacula will issue some other ioctl(), which varies according to the OS. This ioctl() on a Linux machine, for example, is ioctl() MTIOCTOP with mt_op=MTIOCLRERR. In all cases, the purpose of this ioctl() between the write() and the ioctl(WEOF) is to attempt to clear any error condition in the SCSI driver to permit a valid EOF to terminate the Volume. On Linux, this may not be necessary, but on other OSes such as FreeBSD, the SCSI driver locks out virtually all I/O operations after a serious error. > > > Maybe I'm going to write some test code for such a szenario - write > some buffers followed by an MTWEOF op... My best guess is that the problem is some sort of kernel SCSI lock race condition. As a consequence, I would recommend that you concentrate on writing lots of buffers as fast as you can, but from multiple processes, possibly to the same or different drives. In fact, you might try firing off several hundred write processes, and possibly a few read processes to another drive. When the SCSI driver complains about an unexpected disconnect, it is very likely because it either missed an interrupt or it issued a command at a bad time (i.e. a missing lock), or it overran the SCSI command queue. -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users