Re: [Bacula-users] Bacula marking tapes Full with only a few GB written

Kern Sibbald Sun, 16 Apr 2006 13:06:02 -0700

On Sunday 16 April 2006 20:23, Wolfgang Denk wrote:
> In message <[EMAIL PROTECTED]> you wrote:
> > Then it sounds to me more like a bacula issue rather than the SCSI tape
> > driver.
>
> I disagree. We get  pretty  clear  SCSI  error  messages  (unexpected
> disconnect).  No matter what a user application does, the SCSI driver
> must never run into such a situation. This is a SCSI driver problem.
>
> > A problem in diagnosing it is that it is not reproducible. This could
> > indicate a
>
> The problem *is* reproducable. For me it happens pretty reliably. The
> problem is that it takes a loooooong time - typicly hours. And I have
> to admit that I didn't find  (or  take)  the  time  to  really  start
> debugging it. Probably raising debug levels for the SCSI system would
> be a good start, but I'm not convinced.
>
> BTW: I wrote befor that this happens without spooling only; this  was
> wrong.  Scanning  the  logs  I've  seen  cases  of  this problem when
> spooling was active, too.
>
> > timing issue as you've pointed out so if a trace is set up to catch the
> > villain the incident may not occur at all. What can we do?
>
> Let's summarize the observed symptoms again:
>
> * On user level we see error messages like these:
>
>       Error: block.c:538 Write error at 39:5706 on device "SLR100" 
> (/dev/nst0).
> ERR=Input/output error. Error: Error writing final EOF to tape. This Volume
> may not be readable. dev.c:1536 ioctl MTWEOF error on "SLR100" (/dev/nst0).
> ERR=Input/output error.
>
> * On system level we see error messages like these:
>
>       sym0: unexpected disconnect
>       st0: Error 700ff (sugg. bt 0x0, driver bt 0x0, host bt 0x7).
>       sym0: unexpected disconnect
>       st0: Error 700ff (sugg. bt 0x0, driver bt 0x0, host bt 0x7).
>       st0: Error with sense data: <6>st0: Current: sense key: Unit Attention
>           Additional sense: Power on, reset, or bus device reset occurred
>
> * It happens with different types of tape drives; for me with a SLR60
>   driver and 3 x SLR100 autoloaders.
>
> * It happens with different types of SCSI controllers; for me with:
>   - LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI
>   - Adaptec aic7899 Ultra160 SCSI adapter
>   - Adaptec AHA-2940UW Ultra SCSI adapter
>   - Dawicontrol DC-29160 Ultra160 SCSI adapter
>
> * It happens long before the tape is actually full.
>
> * I never had any other kinds of I/O errors, only this "Error writing
>   final EOF"; this boils down to a MTIOCTOP  ioctl()  with  op=MTWEOF
>   and  count=1  -  and  this  is probably the major difference to all
>   other tape tests I've tried: none of the other tools I use to write
>   to a tape (like tar etc.) actually write an EOF themself; they just
>   close the tape device at the end of the write operations.


All your reasoning is absolutely perfect up to this previous point.  In 
looking at the Bacula error messages that you list above, it is always an I/O 
error writing a Bacula block that produces the problem.  Once Bacula gets an 
I/O error, it terminates the tape by writing an EOF, and if it gets an error 
writing this EOF, it reports it along with the message saying that the tape 
may not be readable.  If the EOF is correctly written it will not print the 
"this tape may not be readable" message.   

IMO, the source problem is coming when writing the buffers (a write() request) 
and not subsequent ioctl(WEOF).  Also, between the write() that fails and the 
ioctl(WEOF), Bacula will issue some other ioctl(), which varies according to 
the OS.  This ioctl() on a Linux machine, for example, is ioctl() MTIOCTOP 
with mt_op=MTIOCLRERR.  In all cases, the purpose of this ioctl() between the 
write() and the ioctl(WEOF) is to attempt to clear any error condition in the 
SCSI driver to permit a valid EOF to terminate the Volume.  On Linux, this 
may not be necessary, but on other OSes such as FreeBSD, the SCSI driver 
locks out virtually all I/O operations after a serious error.

>
>
> Maybe I'm going to write some test code for such a szenario  -  write
> some buffers followed by an MTWEOF op...

My best guess is that the problem is some sort of kernel SCSI lock race 
condition.  As a consequence, I would recommend that you concentrate on 
writing lots of buffers as fast as you can, but from multiple processes, 
possibly to the same or different drives.  In fact, you might try firing off 
several hundred write processes, and possibly a few read processes to another 
drive. 

When the SCSI driver complains about an unexpected disconnect, it is very 
likely because it either missed an interrupt or it issued a command at a bad 
time (i.e. a missing lock), or it overran the SCSI command queue.

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Bacula marking tapes Full with only a few GB written

Reply via email to