On Monday 17 April 2006 01:02, Wolfgang Denk wrote: > Dear Kern, > > in message <[EMAIL PROTECTED]> you wrote: > > All your reasoning is absolutely perfect up to this previous point. In > > looking at the Bacula error messages that you list above, it is always an > > I/O error writing a Bacula block that produces the problem. Once Bacula > > gets an > > Argh... Thanks for pointing this out. So I always misinterpreted the > events.
Well, not necessarily. First, you are not as familiar with Bacula messages as I am, and second, after more thought I could be completely wrong see below (or I guess I prefer to say, "perhaps it is even more complicated"). > > > IMO, the source problem is coming when writing the buffers (a write() > > request) and not subsequent ioctl(WEOF). Also, between the write() that > > fails and the ioctl(WEOF), Bacula will issue some other ioctl(), which > > varies according to the OS. This ioctl() on a Linux machine, for > > example, is ioctl() MTIOCTOP with mt_op=MTIOCLRERR. In all cases, the > > purpose of this ioctl() between the write() and the ioctl(WEOF) is to > > attempt to clear any error condition in the SCSI driver to permit a valid > > EOF to terminate the Volume. On Linux, this may not be necessary, but on > > other OSes such as FreeBSD, the SCSI driver locks out virtually all I/O > > operations after a serious error. > > OK. > > > My best guess is that the problem is some sort of kernel SCSI lock race > > condition. As a consequence, I would recommend that you concentrate on > > writing lots of buffers as fast as you can, but from multiple processes, > > possibly to the same or different drives. In fact, you might try firing > > off several hundred write processes, and possibly a few read processes to > > another drive. > > I will try that, but you just blowed my theory of why we see the > problem only with bacula, but never (yet) with any other program > writing to tape. Bacula *does* use the sequence write(), ioctl(WEOF). However, this is done only once every 1GB by default. Maybe this could be why you only see it infrequently. If the problem is happening at that point, then you will not see an I/O error message from the write(), but you will see one from the ioctl(WEOF). Look carefully at the Bacula output. It is also possible that you are getting the error from the sequence: write() ioctl(WEOF) write() Which is the sequence when Bacula writes and EOF once every 1GB. Perhaps the ioctl(WEOF) is causing the write() of the next block to fail. Bacula will then do the ioctl(clear-error) and ioctl(WEOF) "recovery attempt" I mentioned in my previous email. All the above you could be tested by setting "Maximum File Size = 100 MB" for example, and in that case, Bacula will write a *lot* more EOF marks (10 times as many as the default). > > > When the SCSI driver complains about an unexpected disconnect, it is very > > likely because it either missed an interrupt or it issued a command at a > > bad time (i.e. a missing lock), or it overran the SCSI command queue. > > I will try to run some tests... > > Best regards, > > Wolfgang Denk -- Best regards, Kern ("> /\ V_V ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users