Hello,
Volker Dierks schrieb:
...
perhaps I'll give it a try. But a little tale first.
We've got a HP 2/20 Library with 2 DLT-8000 drives. Our backup box is
running
Debian GNU/Linux 3.0, Bacula 1.38.2 and 11 nodes. The system has gone into
production on Wednesday (with one drive) and tremendous success. Bacula is
really great.
To speed things up, I tried to activate the second drive on Thursday. I've
created a second pool and relabeled some tapes into that pool.
Everything I've
found - regarding using multiple drives - says, that several pools are
needed.
This were the configuration changes:
...
After 20 minutes I tried to cancel the (still stucked) jobs without
success.
Thus I stoppped bacula-dir and bacula-sd which leaves two bacula-sd
processes
in status D behind. They couldn't be killed so I rebooted the box. This
also
failed with a booted kernel saying that init couldn't find the root
partition.
After a poweroff/on the box came up as usual.
Well, I haven't tried jobs going to different drives in one autochanger,
so I won't discuss that part of your report.
My conclusion is that the second drive is faulty and blew up the SCSI bus
(see the kernel log at the end). Job 2 was stuck at 160 MB. In the meantime
job 1 finished writing 450 MB and job 3 was started. If I remember
correctly,
job 3 was able to write 2.6 GB to drive one until it also got stucked. I
don't
know if a faulty tape can rise up such an incedent.
Hardly, but that doesn't mean it's impossible. Similar kernel driver
reports and SCSI subsystem hangs have occured here, and I'm quite sure -
again, not absolutely - that they resulted of a combination of a drive
hardware error and an imperfect driver.
I fact, there are reports that that the aic7xxx driver doesn't work
correctly in all cases, caused by different hardware on different SCSI
HBAs. As far as I know, there have been some issues with the controller
chips handled by this driver, which Adaptec tried to rsolve by a number
of "silent" hardware updates. The Adaptec-supplied windows drivers
obviously know how to handle the different hardware capabilities (and
errors, as some might say), but the linux drivers don't implement the
necessary functions for all cases. This all is third-hand knowledge and
completely NOT backed up by any real understanding of the AIC chips and
the corresponding drivers, by the way. Still, I found the source code of
the linux drivers quite interesting, as there are some references to
special handling of certain conditions on some AIC chips.
By the way: Here, when I saw such errors, they wrere, as far as I can
say always caused by actual SCSI errors from some devices - I had a
spool disk dying during despooling, for example, and I had some real
tape drive errors that could only be recovered by power cycling the tape
drive. Still, some of the errors I could identify *should* have been
handled by the drivers without a SCSI subsystem breakdown.
Usually, I'd see if the problem can be reproduced with the existing
system setup. If that's possible, I'd first check if the actual cause
might be purely SCSI device related.
On the other hand (which is what I hope) there could be a configuration
error
(Job {} and Client {} didn't have Maximum Concurrent Jobs set) or the
changes
in this BETA will fix this behaviour.
Well, you can always try it, assuming you accept to use beta software in
a production system. Having read Kerns report, personaly, I'd try it,
but I don't have really vital data here. Of course, as far as I see,
it's unlikely that Bacula can destroy existing data, in the worst cases
I can imagine you might lose some existing volumes and your catalog, I
think.
Arno
I've planned to add the second drive again tomorrow and use another tape.
Should I also upgrade to 1.38.3?
Volker
Dec 9 01:18:59 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT
message
Dec 9 01:18:59 backup kernel: CDB: 0xa 0x0 0x0 0xfc 0x0 0x0
Dec 9 01:18:59 backup kernel: scsi1: At time of recovery, card was not
paused
Dec 9 01:18:59 backup kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins
<<<<<<<<<<<<<<<<<
Dec 9 01:18:59 backup kernel: scsi1: Dumping Card State while idle, at
SEQADDR 0x8
Dec 9 01:18:59 backup kernel: Card was paused
Dec 9 01:18:59 backup kernel: ACCUM = 0x0, SINDEX = 0x3, DINDEX = 0xe4,
ARG_2 = 0x0
Dec 9 01:18:59 backup kernel: HCNT = 0x0 SCBPTR = 0x0
Dec 9 01:18:59 backup kernel: SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0]
SCSIBUSL[0x0] Dec 9 01:18:59 backup kernel: LASTPHASE[0x1]
SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] Dec 9 01:18:59 backup kernel:
SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] Dec 9 01:18:59
backup kernel: SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] Dec 9
01:18:59 backup kernel: SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] Dec
9 01:18:59 backup kernel: STACK: 0x0 0x163 0x109 0x3
Dec 9 01:18:59 backup kernel: SCB count = 5
Dec 9 01:18:59 backup kernel: Kernel NEXTQSCB = 2
Dec 9 01:18:59 backup kernel: Card NEXTQSCB = 2
Dec 9 01:18:59 backup kernel: QINFIFO entries: Dec 9 01:18:59 backup
kernel: Waiting Queue entries: Dec 9 01:18:59 backup kernel:
Disconnected Queue entries: 1:4 Dec 9 01:18:59 backup kernel: QOUTFIFO
entries: Dec 9 01:18:59 backup kernel: Sequencer Free SCB List: 0 2 3 4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 Dec 9 01:18:59 backup kernel: Sequencer SCB Info: Dec 9 01:18:59
backup kernel: 0 SCB_CONTROL[0xc0] SCB_SCSIID[0x47] SCB_LUN[0x0]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 1 SCB_CONTROL[0x44]
SCB_SCSIID[0x57] SCB_LUN[0x0] SCB_TAG[0x4] Dec 9 01:18:59 backup
kernel: 2 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 3 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 5 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 7 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 9 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 11 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 13 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 15 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 16 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 17 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 18 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 19 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 21 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 23 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 25 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 27 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 29 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff]
SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 31 SCB_CONTROL[0x0]
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup
kernel: Pending list: Dec 9 01:18:59 backup kernel: 4
SCB_CONTROL[0x40] SCB_SCSIID[0x57] SCB_LUN[0x0] Dec 9 01:18:59 backup
kernel: Kernel Free SCB list: 3 1 0 Dec 9 01:18:59 backup kernel:
Untagged Q(5): 4 Dec 9 01:18:59 backup kernel: DevQ(0:3:0): 0 waiting
Dec 9 01:18:59 backup kernel: DevQ(0:3:63): 0 waiting
Dec 9 01:18:59 backup kernel: DevQ(0:4:0): 0 waiting
Dec 9 01:18:59 backup kernel: DevQ(0:5:0): 0 waiting
Dec 9 01:18:59 backup kernel: Dec 9 01:18:59 backup kernel:
<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
Dec 9 01:18:59 backup kernel: (scsi1:A:5:0): Device is disconnected,
re-queuing SCB
Dec 9 01:18:59 backup kernel: (scsi1:A:5:0): Abort Message Sent
Dec 9 01:18:59 backup kernel: Recovery code sleeping
Dec 9 01:18:59 backup kernel: (scsi1:A:5:0): SCB 4 - Abort Completed.
Dec 9 01:18:59 backup kernel: Recovery SCB completes
Dec 9 01:18:59 backup kernel: Recovery code awake
Dec 9 01:18:59 backup kernel: aic7xxx_abort returns 0x2002
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
--
IT-Service Lehmann [EMAIL PROTECTED]
Arno Lehmann http://www.its-lehmann.de
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users