Re: [perf-discuss] [zfs-discuss] help diagnosing system hang

Peter Eriksson Tue, 07 Jul 2009 13:49:44 -0700

Interresting... I wonder what differs between your system and mine. With my 
dirt-simple stress-test:


server1# zpool create X25E c1t15d0
server1# zfs set sharenfs=rw X25E
server1# chmod a+w /X25E

server2# cd /net/server1/X25E
server2# gtar zxf /var/tmp/emacs-22.3.tar.gz

and a fully patched X42420 running Solaris 10 U7 I still see these errors:

Jul  7 22:35:04 merope  Error for Command: write(10)               Error Level: 
Retryable
Jul  7 22:35:04 merope scsi:    Requested Block: 5301376                   
Error Block: 5301376
Jul  7 22:35:04 merope scsi:    Vendor: ATA                                
Serial Number: CVEM849300BM
Jul  7 22:35:04 merope scsi:    Sense Key: Unit Attention
Jul  7 22:35:04 merope scsi:    ASC: 0x29 (power on, reset, or bus reset 
occurred), ASCQ: 0x0, FRU: 0x0
Jul  7 22:35:09 merope scsi: WARNING: 
/p...@0,0/pci10de,3...@f/pci1000,3...@0/s...@f,0 (sd32):
Jul  7 22:35:09 merope  Error for Command: write(10)               Error Level: 
Retryable
Jul  7 22:35:09 merope scsi:    Requested Block: 5315248                   
Error Block: 5315248
Jul  7 22:35:09 merope scsi:    Vendor: ATA                                
Serial Number: CVEM849300BM
Jul  7 22:35:09 merope scsi:    Sense Key: Unit Attention
Jul  7 22:35:09 merope scsi:    ASC: 0x29 (power on, reset, or bus reset 
occurred), ASCQ: 0x0, FRU: 0x0

I had an idea that this might be due to NCQ overruns since if I'm not mistaken 
the X25E only supports 32 outstanding commands, so I've started testing various 
things. Setting sd_max_throttle in /etc/system doesn't seem to make any 
difference. 

However...  tuning zfs_vdev_max_pendning down from 35 to 10 made a difference. 
Whereas if I used to see long "hickups" in "zpool iostat X25E 10", with that 
one tuned down to 10 things run *much* more smoothly - no hickups:

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
X25E         284K  29.7G      0      5    332   549K
X25E         284K  29.7G      0    197      0  5.70M
X25E         284K  29.7G      0    197      0  2.28M
X25E        59.8M  29.7G      0    322      0  10.9M
X25E        59.8M  29.7G      0    418      0  7.97M
X25E        59.8M  29.7G      0    588      0  10.3M

Still a lot of of the same errors on the console though 
(more often actually...)

Output from iostat -zx 10 if it is of interest:

                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd32      0.0  718.7    0.0 1437.4  0.0  0.0    0.0   0   3 
sd32      0.0  401.1    0.0 6089.1  0.0  0.8    2.1   0  43 
sd32      0.0 1187.5    0.0 12341.7  0.0  0.7    0.6   2  37 
sd32      0.0  758.2    0.0 14835.1  0.0  1.7    2.3   4  66 
sd32      0.0  403.1    0.0 4606.8  0.0  1.5    3.9   4  77 
sd32      0.0  350.8    0.0 3420.8  0.0  1.6    4.6   4  80 
sd32      0.0  315.9    0.0 8578.2  0.0  0.4    1.1   0   6 

I'm really curious what is causing these errors... It's almost 
like it's something else that is causing them. Perhaps some
"flush cache" command that is executed after the writes to the
device has been done (since I see this error more often with 
'zfs_vdev_max_pendning' tuned down). 

Another interresting thing is that I only see this for I/O issued 
from a remote server over NFS. If I write directly to the X25E
volume on the server things work really smooth.
-- 
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] [zfs-discuss] help diagnosing system hang

Reply via email to