Hello,

Volker Dierks schrieb:
...
perhaps I'll give it a try. But a little tale first.

We've got a HP 2/20 Library with 2 DLT-8000 drives. Our backup box is running
Debian GNU/Linux 3.0, Bacula 1.38.2 and 11 nodes. The system has gone into
production on Wednesday (with one drive) and tremendous success. Bacula is
really great.

To speed things up, I tried to activate the second drive on Thursday. I've
created a second pool and relabeled some tapes into that pool. Everything I've found - regarding using multiple drives - says, that several pools are needed.
This were the configuration changes:
...
After 20 minutes I tried to cancel the (still stucked) jobs without success. Thus I stoppped bacula-dir and bacula-sd which leaves two bacula-sd processes in status D behind. They couldn't be killed so I rebooted the box. This also failed with a booted kernel saying that init couldn't find the root partition.
After a poweroff/on the box came up as usual.

Well, I haven't tried jobs going to different drives in one autochanger, so I won't discuss that part of your report.

My conclusion is that the second drive is faulty and blew up the SCSI bus
(see the kernel log at the end). Job 2 was stuck at 160 MB. In the meantime
job 1 finished writing 450 MB and job 3 was started. If I remember correctly, job 3 was able to write 2.6 GB to drive one until it also got stucked. I don't
know if a faulty tape can rise up such an incedent.

Hardly, but that doesn't mean it's impossible. Similar kernel driver reports and SCSI subsystem hangs have occured here, and I'm quite sure - again, not absolutely - that they resulted of a combination of a drive hardware error and an imperfect driver.

I fact, there are reports that that the aic7xxx driver doesn't work correctly in all cases, caused by different hardware on different SCSI HBAs. As far as I know, there have been some issues with the controller chips handled by this driver, which Adaptec tried to rsolve by a number of "silent" hardware updates. The Adaptec-supplied windows drivers obviously know how to handle the different hardware capabilities (and errors, as some might say), but the linux drivers don't implement the necessary functions for all cases. This all is third-hand knowledge and completely NOT backed up by any real understanding of the AIC chips and the corresponding drivers, by the way. Still, I found the source code of the linux drivers quite interesting, as there are some references to special handling of certain conditions on some AIC chips.

By the way: Here, when I saw such errors, they wrere, as far as I can say always caused by actual SCSI errors from some devices - I had a spool disk dying during despooling, for example, and I had some real tape drive errors that could only be recovered by power cycling the tape drive. Still, some of the errors I could identify *should* have been handled by the drivers without a SCSI subsystem breakdown.

Usually, I'd see if the problem can be reproduced with the existing system setup. If that's possible, I'd first check if the actual cause might be purely SCSI device related.

On the other hand (which is what I hope) there could be a configuration error (Job {} and Client {} didn't have Maximum Concurrent Jobs set) or the changes
in this BETA will fix this behaviour.

Well, you can always try it, assuming you accept to use beta software in a production system. Having read Kerns report, personaly, I'd try it, but I don't have really vital data here. Of course, as far as I see, it's unlikely that Bacula can destroy existing data, in the worst cases I can imagine you might lose some existing volumes and your catalog, I think.

Arno

I've planned to add the second drive again tomorrow and use another tape.
Should I also upgrade to 1.38.3?

Volker

Dec 9 01:18:59 backup kernel: scsi1:0:5:0: Attempting to queue an ABORT message
Dec  9 01:18:59 backup kernel: CDB: 0xa 0x0 0x0 0xfc 0x0 0x0
Dec 9 01:18:59 backup kernel: scsi1: At time of recovery, card was not paused Dec 9 01:18:59 backup kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< Dec 9 01:18:59 backup kernel: scsi1: Dumping Card State while idle, at SEQADDR 0x8
Dec  9 01:18:59 backup kernel: Card was paused
Dec 9 01:18:59 backup kernel: ACCUM = 0x0, SINDEX = 0x3, DINDEX = 0xe4, ARG_2 = 0x0
Dec  9 01:18:59 backup kernel: HCNT = 0x0 SCBPTR = 0x0
Dec 9 01:18:59 backup kernel: SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] Dec 9 01:18:59 backup kernel: LASTPHASE[0x1] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0x0] Dec 9 01:18:59 backup kernel: SEQCTL[0x10] SEQ_FLAGS[0xc0] SSTAT0[0x0] SSTAT1[0x8] Dec 9 01:18:59 backup kernel: SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xa4] Dec 9 01:18:59 backup kernel: SXFRCTL0[0x80] DFCNTRL[0x0] DFSTATUS[0x89] Dec 9 01:18:59 backup kernel: STACK: 0x0 0x163 0x109 0x3
Dec  9 01:18:59 backup kernel: SCB count = 5
Dec  9 01:18:59 backup kernel: Kernel NEXTQSCB = 2
Dec  9 01:18:59 backup kernel: Card NEXTQSCB = 2
Dec 9 01:18:59 backup kernel: QINFIFO entries: Dec 9 01:18:59 backup kernel: Waiting Queue entries: Dec 9 01:18:59 backup kernel: Disconnected Queue entries: 1:4 Dec 9 01:18:59 backup kernel: QOUTFIFO entries: Dec 9 01:18:59 backup kernel: Sequencer Free SCB List: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Dec 9 01:18:59 backup kernel: Sequencer SCB Info: Dec 9 01:18:59 backup kernel: 0 SCB_CONTROL[0xc0] SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 1 SCB_CONTROL[0x44] SCB_SCSIID[0x57] SCB_LUN[0x0] SCB_TAG[0x4] Dec 9 01:18:59 backup kernel: 2 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 3 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 5 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 7 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 16 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 17 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 18 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 19 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff] Dec 9 01:18:59 backup kernel: Pending list: Dec 9 01:18:59 backup kernel: 4 SCB_CONTROL[0x40] SCB_SCSIID[0x57] SCB_LUN[0x0] Dec 9 01:18:59 backup kernel: Kernel Free SCB list: 3 1 0 Dec 9 01:18:59 backup kernel: Untagged Q(5): 4 Dec 9 01:18:59 backup kernel: DevQ(0:3:0): 0 waiting
Dec  9 01:18:59 backup kernel: DevQ(0:3:63): 0 waiting
Dec  9 01:18:59 backup kernel: DevQ(0:4:0): 0 waiting
Dec  9 01:18:59 backup kernel: DevQ(0:5:0): 0 waiting
Dec 9 01:18:59 backup kernel: Dec 9 01:18:59 backup kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> Dec 9 01:18:59 backup kernel: (scsi1:A:5:0): Device is disconnected, re-queuing SCB
Dec  9 01:18:59 backup kernel: (scsi1:A:5:0): Abort Message Sent
Dec  9 01:18:59 backup kernel: Recovery code sleeping
Dec  9 01:18:59 backup kernel: (scsi1:A:5:0): SCB 4 - Abort Completed.
Dec  9 01:18:59 backup kernel: Recovery SCB completes
Dec  9 01:18:59 backup kernel: Recovery code awake
Dec  9 01:18:59 backup kernel: aic7xxx_abort returns 0x2002


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


--
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to