On Monday 30 July 2007 17:29, Ryan Novosielski wrote:
> Kern Sibbald wrote:
> > On Sunday 29 July 2007 19:28, Ryan Novosielski wrote:
> >> Hi all,
> >>
> >> Ever since I added the TapeAlert/smartmonctl command to my tape drive,
> >> it appears as if I get a fairly regular crash of that bacula-sd. I know
> >> there is a case where Bacula and the utility can go for the tape drive
> >> at the same time and cause problems, but I don't think Bacula should go
> >> KABOOM when this happens.
> >
> > The traceback, unfortunately, doesn't demangle the C++ subroutine names 
nor
> > provide source line numbers, but the best that I can tell is that the heap
> > has been corrupted, Bacula detects is, then does a Kaboom (self inflicted 
seg
> > fault).
> 
> I'm not too good with development tools -- is there a reason this would
> be that is on my end (stripped binaries or something like that)?

Yes, if they are stripped, that would at least explain the lack of line 
numbers and possibly the fact that the names were not demangled.

> 
> > Are you by any chance pointing the tapealert/smartmonctl at the tape drive
> > device rather than at the scsi control device?  If you are, I am not
> > surprised, and you should remove it as two different programs cannot 
properly
> > exist using the same tape device.
> 
> Yes. 

Well, that is almost surely the cause of the problem.

> And I actually will remove that, but near as I can tell, Solaris 
> does not have a way of addressing the tape drive as two different
> devices. 

Well, Solaris has 20 ways to do everything connected with devices.  In any 
case, it is not a question of addressing the tape drive as two different 
devices.  One addresses the tape drive through the standard tape driver, 
which is the /dev/rmt/xxxx stuff.

The other is a SCSI pass through driver that allow SCSI commands to go 
directly to the SCSI controller, and if I remember right they are addressed 
with /dev/sg/xxxx.  That you will need to find out from someone else on the 
list or from your manuals -- I no longer have a Solaris.

> From what I've read, the reason for this is that Solaris 
> supposedly has the ability to do two actions on the device at once --
> I'm not sure where I read that though in order to confirm it. I went
> looking for information about using the 'sgen' Solaris driver in order
> to instead use the control interface, but it appears as if sgen is only
> used to pick up devices that don't already have a type elsewhere; in
> other words, I could stop using 'st' and start using 'sgen', but that
> really wouldn't get me anywhere. Perhaps someone else will read about
> this and give me a pointer.

I'll leave this to others.

> 
> > If you are pointing it at the scsi control device, I would be interested 
to
> > see what the normal output of the command gives back as there may be a
> > possible buffer overrun though that really should not happen.
> >
> > In any case, I recommend that you remove the tape alert for a time and see 
if
> > that eliminates the problem.
> 
> I suspect it will, as it only showed up when I added it, near as I can
> tell. A KABOOM seems like something that ought not happen either way,
> though, although I suppose if something is corrupting buffers, it can't
> be avoided. Curious, though, as the tapealert often returns "Device
> busy" which would seem to mean that there's no change that the other
> thing using the device would actually have an error.

Well, when Bacula calls the tapealert command, it releases the drive, so it 
doesn't get a busy. 

Many OSes such as Linux and Solaris permit addressing the SCSI controller 
directly through the normal tape driver, but this is a very bad idea (it is 
apparently what you are doing), and from everything users have said, it 
causes lots of problems such as resetting the SCSI controller.  If you do 
that, all bets are off concerning Bacula correctly interfacing through the 
normal tape driver.  

If you would like to track it down, I think it would be good to eliminate the 
KABOOM if it is possible (it may well not be possible).  However, you *are* 
apparently doing something very non-standard, and that is where things go 
wrong.

> 
> >> This does not happen every day, but every once in awhile... it occurs at
> >> the end of a set of concurrent backups to tape -- all incrementals, 7 in
> >> total. By the time my catalog backup runs 2 hours later, the -sd has
> >> died and there is no connection made.
> >>
> >> The host machine is running Solaris 9, and the binaries are from
> >> BlastWave (currently version 2.0.3 with 2.0.2 clients, but until the day
> >> before yesterday, the admin/server machine was running 2.0.2 with
> >> identical results). I have not tried 2.1.x, but I would not be allowed
> >> to run a production schedule on a beta -- perhaps an exact copy on the
> >> same machine but writing to disk might yield the same results, but I
> >> suspect that this is caused by the TapeAlert, so maybe not.
> >
> > For a problem with tape alert, it is very unlikely that upgrading to 2.1.x
> > will help.
> >
> >> Thanks for any insights you can provide -- I'd be happy to report a bug
> >> if it is needed.
> >
> > Until I see your response and think about it, I don't think this is worth 
a
> > bug report, at least not just yet.
> 
> OK, that is fine. If there's any easy way to try to get more information
> out of this thing, let me know. I actually had a fair amount of trouble
> getting this much in the first place -- if you run your bacula-dir as a
> non-root user as one really should, it then cannot run proper traces
> against daemons that run as root. I had to involve sudo; originally, I
> had no idea that this even ran by itself until I saw a number of empty
> traceback e-mails in root's box.
> 

Yes, well, system security measure often make debugging more difficult.  
Unfortunately, there is not much I can do about that except to say either you 
need to understand the finer points of your OSes systems security or run as 
root when attempting to debug these kinds of problems (I *never* debug as 
root here though).

My recommendation is to stop using tapealert until you figure out what the 
pass through SCSI driver is, then you could try using it.  If you can figure 
out how to run the debugger manually and dig deeper into the problem, that 
would be interesting too, but it is likely to take a lot of time ...

Regards,

Kern

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to