On Tue, Apr 6, 2010 at 3:23 AM, Matija Nalis
<mnalis+bac...@carnet.hr<mnalis%2bbac...@carnet.hr>
> wrote:
> On Sun, Apr 04, 2010 at 01:20:49PM -0600, Robert LeBlanc wrote:
> > I'm having problems with our SD and tapes being locked in the
> > drive occasionally.
>
> How does it manifest exactly ? bconsole umount command returns error,
> or remains in some state (check with "status storage") ? Which state
> and/or error ?
>
> Have you tried shutting down bacula-sd and ejecting tape with "mt
> eject" and/or "mt offline" ? Do they succeed (and the drive ejects)
> or do they return error (and which one) ? Double check that bacula-sd
> is down before you try those (they won't work if bacula-sd is still
> having the drive open).
>
And if mt(1) also fails, can you eject tape manually by using tape
> library eject function and/or pressing hardware eject button on the
> drive itself (depending on the library type...) ?
>
I've tried in the past to do exactly this. Bacula will usually spit out an
error that the tape could not be moved or in rarer situations say the drive
is not there. I then shut down bacula-sd and try to run the mt eject command
I I usually get back about ten lines that describe the error, but it does
really make sense. Sometimes the drive doesn't appear as a device on the
system any more. As far as the tape library, the Overland Neo 8000 most of
the time says soft removal error on the screen and will keep saying that if
I try to have the library remove it. There is no easy way to get to the
hardware eject button as the library is fully enclosed.
We have had the LTO-3 drives replaced on multiple occasions for this very
reason, that is why I don't think it is the drive. Each drive has a fibre
connection to the Fibre switch and it is has happened on both our LTO-3
drives and our LTO-4 drive. The only thing that I can think of is that
bacula is trying to take some shortcuts (issuing a command to move the tape
and expecting the tape library to correctly rewind the tape, eject and then
move it and maybe bacula is not quiet letting go of the drive fast enough
and there gets a deadlock between the drive controlled by Bacula and the
library trying to control it), or there is a kernel/driver problem.
I've set the offline=1 in mtx-changer.conf and that seems to help a little,
I've still encountered some drive unmouting issues, but nothing that bacula
hasn't been able to recover from on it's own or with very little manual
intervention.
> If mt works but bacula-sd doesn't, than you can rule out hardware and
> kernel -- it is bacula problem (and usually "status storage" will
> show it -- it can happen sometimes if you have more than one drive
> that it deadlocks by waiting for a tape that is in the other drive).
>
> > At first I thought this might be a problem with our tape
> > library.
>
> That still looks like the most probable cause to me - like a drive in
> the library is having problems. We've had a similar issue with one of
> several LTO2 drives in our library; it would (sometimes) take the
> tape and refuse to give it back (on "mt eject" and even physical
> button touch). Needed power cycling and long (half a minute?) button
> press to make it give the tape back.
>
> After it happened third time (always the same drive) we kicked it out
> of the library. Other drives worked OK all the time.
>
> If the hardware button always works but software commands don't, it
> could be fiber cables and/or GBIC/SPF (which we refused to believe at
> one time because drives were always detected OK and worked, albeit
> sometimes much slower than normal, without any errors in kernel logs,
> and would also lock up). You can try cleaning tape also.
>
> > Then I saw these errors in the syslog. I switched out the Qlogic FC
> > adapter thinking that maybe it was just losing all the paths to the
> drive.
>
> AFAIR you would get different errors if it loses path completely (but
> it is possible for drive to behave erratically even if it doesn't
> lose path)
>
I have seen times where there are path errors and that is when the drive
seems to disappear from the system completely, but this is not the usual
case and the one that causes the most problems.
> > I'm still getting the errors, so I'm not sure where the hangup is. I
> can't
> > tell if it's a bug in the kernel module, mt or bacula. Can someone give
> me
> > some pointers to narrowing this down? This has been happening for over a
> > year and through several kernel and bacula versions.
> >
> > This is Debian Squeeze
> >
> > Linux lsddomainsd 2.6.32-trunk-686 #1 SMP Sun Jan 10 06:32:16 UTC 2010
> i686
> > GNU/Linux
>
> The "INFO:" messages themselves are just "normal" feature of newer
> 2.6.x kernels, they are informational message only (See "INFO:") that
> tells you some system call (like open(2) or write(2) or read(2)) is
> taking longer than 120 seconds to complete. They didn't exist in
> older kernels.
>
> It is there to catch problems with I/O schedulers and problematic
> hardware issues -- but sometime it needs to be increased for tape
> drives (it is quite possible for open(2) or lseek(2) on tape to have
> to rewind it, and that sometimes can take more than two minutes).
>
> you can raise the current kernel limit with:
> echo 300 > /proc/sys/kernel/hung_task_timeout_secs
>
> or (to survive reboot) by putting:
> kernel.hung_task_timeout_secs=300
> in /etc/sysctl.conf (or a file in /etc/sysctl.d directory)
>
> But as I say, those will not help your lockup problems, just make the
> spurious messages go away when they are to be expected.
>
>
> Try the other things in the mail to narrow the problem down to
> bacula, kernel or hardware.
>
I was pretty sure the messages were informational, I'm glad that someone can
confirm that. I'll keep working on the problem to see what I can come up
with. If there is a better way to tell Bacula to be stupid slow with unmount
and mount requests, that may help me find where in the process things are
getting hung up.
Thanks,
Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users