Hi,

I'm having trouble with scsi timeouts, but it appears to only happen when I use ZFS. I've tried to replicate with SVM, but I can't get the timeouts to happen when that is the underlying volume manager, however the performance with ZFS is much better when it does work.

The symptom is that at some point when the system is somewhat busy, the disk I/O seems to hang for about a minute or so (with iostat showing the %busy column at 100%), then I see a flood of messages like below, then it resets the bus and retries the transaction and continues on where it left off. The messages look like:

Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 0 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 3 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 2 lun 0. Initiating recovery. Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 1 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula adpu320: [ID 138499 kern.warning] WARNING: Timeout on target 4 lun 0. Initiating recovery.
Nov 22 18:55:23 nebula last message repeated 1 time
Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@4,0 (sd38): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 225914045 Error Block: 225914045 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80ARRWK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@2,0 (sd36): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882344 Error Block: 90882344 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80BNNFK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@3,0 (sd37): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 225914045 Error Block: 225914045 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: MAXTOR Serial Number: J80BDCKK Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x0 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@0,0 (sd34): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882394 Error Block: 90882394 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 3KR0VPBF Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x2 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,3...@6/pci8086,3...@0,2/pci9005,4...@3/s...@1,0 (sd35): Nov 22 18:55:23 nebula Error for Command: write(10) Error Level: Retryable Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Requested Block: 90882348 Error Block: 90882348 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 3KR0WLM4 Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 22 18:55:23 nebula scsi: [ID 107833 kern.notice] ASC: 0x29 (scsi bus reset occurred), ASCQ: 0x2, FRU: 0x2

I have a Dell 2850 with an Adaptec ASC-39320A U320 Dual SCSI 39320A card. I've connected both channels to a split bus Dell PowerVault 220S disk array with 11 300GB 10K drives via 2 cables. I have already swapped the HBA and both cables. I've moved disks around, tried subsets of disks, but it still seems to give the problems regardless of the disk configuration, or whether one or both controllers are used

I've tried raidz2, raidz1, and mirrors, but it eventually gets hung and issues a timeout (and it does this several times a day). I've tried both raid5 and mirror using SVM, but it never gets the timeout (but the raid5 quite a bit slower, so I'd like to stick with ZFS).
There's no problem if you just put UFS on the raw disks.
I've run diskomizer for many hours using without a problem using raw disks, and UFS on the disks.

I had planned on making this system a master database server, however I'm still getting with it running as a slave, so I don't have any comfort to promote this system to the master with the timeouts.

Any suggestions?

Thanks,

Brian


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to