Hi,
I'm having trouble with scsi timeouts, but it appears to only happen
when I use ZFS.
I've tried to replicate with SVM, but I can't get the timeouts to happen
when that is the underlying volume manager, however the performance with
ZFS is much better when it does work.
The symptom is that at some point when the system is somewhat busy, the
disk I/O seems to hang for about a minute or so (with iostat showing the
%busy column at 100%), then I see a flood of messages like below, then
it resets the bus and retries the transaction and continues on where it
left off. The messages look like:
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 2 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING:
Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@4,0 (sd38):
Nov 22 18:55:23 nebula Error for Command: write(10) Error
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block:
225914045 Error Block: 225914045
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor:
MAXTOR Serial Number: J80ARRWK
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@2,0 (sd36):
Nov 22 18:55:23 nebula Error for Command: write(10) Error
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block:
90882344 Error Block: 90882344
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor:
MAXTOR Serial Number: J80BNNFK
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@3,0 (sd37):
Nov 22 18:55:23 nebula Error for Command: write(10) Error
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block:
225914045 Error Block: 225914045
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor:
MAXTOR Serial Number: J80BDCKK
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi
bus reset occurred), ASCQ: 0x2, FRU: 0x0
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@0,0 (sd34):
Nov 22 18:55:23 nebula Error for Command: write(10) Error
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block:
90882394 Error Block: 90882394
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor:
SEAGATE Serial Number: 3KR0VPBF
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi
bus reset occurred), ASCQ: 0x2, FRU: 0x2
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@1,0 (sd35):
Nov 22 18:55:23 nebula Error for Command: write(10) Error
Level: Retryable
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block:
90882348 Error Block: 90882348
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor:
SEAGATE Serial Number: 3KR0WLM4
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi
bus reset occurred), ASCQ: 0x2, FRU: 0x2
I have a Dell 2850 with an Adaptec ASC-39320A U320 Dual SCSI 39320A
card. I've connected both channels to a split bus Dell PowerVault 220S
disk array with 11 300GB 10K drives via 2 cables. I have already
swapped the HBA and both cables. I've moved disks around, tried subsets
of disks, but it still seems to give the problems regardless of the disk
configuration, or whether one or both controllers are used
I've tried raidz2, raidz1, and mirrors, but it eventually gets hung and
issues a timeout (and it does this several times a day).
I've tried both raid5 and mirror using SVM, but it never gets the
timeout (but the raid5 quite a bit slower, so I'd like to stick with ZFS).
There's no problem if you just put UFS on the raw disks.
I've run diskomizer for many hours using without a problem using raw
disks, and UFS on the disks.
I had planned on making this system a master database server, however
I'm still getting with it running as a slave, so I don't have any
comfort to promote this system to the master with the timeouts.
Any suggestions?
Thanks,
Brian
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss