more below...
On Oct 24, 2009, at 2:49 AM, Adam Cheal wrote:
The iostat I posted previously was from a system we had already
tuned the zfs:zfs_vdev_max_pending depth down to 10 (as visible by
the max of about 10 in actv per disk).
I reset this value in /etc/system to 7, rebooted, and started a
scrub. iostat output showed busier disks (%b is higher, which seemed
odd) but a cap of about 7 queue items per disk, proving the tuning
was effective. iostat at a high-water mark during the test looked
like this:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t1d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t2d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c8t3d0
8344.5 0.0 359640.4 0.0 0.1 300.5 0.0 36.0 0 4362 c9
190.0 0.0 6800.4 0.0 0.0 6.6 0.0 34.8 0 99 c9t8d0
185.0 0.0 6917.1 0.0 0.0 6.1 0.0 32.9 0 94 c9t9d0
187.0 0.0 6640.9 0.0 0.0 6.5 0.0 34.6 0 98 c9t10d0
186.5 0.0 6543.4 0.0 0.0 7.0 0.0 37.5 0 100 c9t11d0
180.5 0.0 7203.1 0.0 0.0 6.7 0.0 37.2 0 100 c9t12d0
195.5 0.0 7352.4 0.0 0.0 7.0 0.0 35.8 0 100 c9t13d0
188.0 0.0 6884.9 0.0 0.0 6.6 0.0 35.2 0 99 c9t14d0
204.0 0.0 6990.1 0.0 0.0 7.0 0.0 34.3 0 100 c9t15d0
199.0 0.0 7336.7 0.0 0.0 7.0 0.0 35.2 0 100 c9t16d0
180.5 0.0 6837.9 0.0 0.0 7.0 0.0 38.8 0 100 c9t17d0
198.0 0.0 7668.9 0.0 0.0 7.0 0.0 35.3 0 100 c9t18d0
203.0 0.0 7983.2 0.0 0.0 7.0 0.0 34.5 0 100 c9t19d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c9t20d0
195.5 0.0 7096.4 0.0 0.0 6.7 0.0 34.1 0 98 c9t21d0
189.5 0.0 7757.2 0.0 0.0 6.4 0.0 33.9 0 97 c9t22d0
195.5 0.0 7645.9 0.0 0.0 6.6 0.0 33.8 0 99 c9t23d0
194.5 0.0 7925.9 0.0 0.0 7.0 0.0 36.0 0 100 c9t24d0
188.5 0.0 6725.6 0.0 0.0 6.2 0.0 32.8 0 94 c9t25d0
188.5 0.0 7199.6 0.0 0.0 6.5 0.0 34.6 0 98 c9t26d0
196.0 0.0 6666.9 0.0 0.0 6.3 0.0 32.1 0 95 c9t27d0
193.5 0.0 7455.4 0.0 0.0 6.2 0.0 32.0 0 95 c9t28d0
189.0 0.0 7400.9 0.0 0.0 6.3 0.0 33.2 0 96 c9t29d0
182.5 0.0 9397.0 0.0 0.0 7.0 0.0 38.3 0 100 c9t30d0
192.5 0.0 9179.5 0.0 0.0 7.0 0.0 36.3 0 100 c9t31d0
189.5 0.0 9431.8 0.0 0.0 7.0 0.0 36.9 0 100 c9t32d0
187.5 0.0 9082.0 0.0 0.0 7.0 0.0 37.3 0 100 c9t33d0
188.5 0.0 9368.8 0.0 0.0 7.0 0.0 37.1 0 100 c9t34d0
180.5 0.0 9332.8 0.0 0.0 7.0 0.0 38.8 0 100 c9t35d0
183.0 0.0 9690.3 0.0 0.0 7.0 0.0 38.2 0 100 c9t36d0
186.0 0.0 9193.8 0.0 0.0 7.0 0.0 37.6 0 100 c9t37d0
180.5 0.0 8233.4 0.0 0.0 7.0 0.0 38.8 0 100 c9t38d0
175.5 0.0 9085.2 0.0 0.0 7.0 0.0 39.9 0 100 c9t39d0
177.0 0.0 9340.0 0.0 0.0 7.0 0.0 39.5 0 100 c9t40d0
175.5 0.0 8831.0 0.0 0.0 7.0 0.0 39.9 0 100 c9t41d0
190.5 0.0 9177.8 0.0 0.0 7.0 0.0 36.7 0 100 c9t42d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c9t43d0
196.0 0.0 9180.5 0.0 0.0 7.0 0.0 35.7 0 100 c9t44d0
193.5 0.0 9496.8 0.0 0.0 7.0 0.0 36.2 0 100 c9t45d0
187.0 0.0 8699.5 0.0 0.0 7.0 0.0 37.4 0 100 c9t46d0
198.5 0.0 9277.0 0.0 0.0 7.0 0.0 35.2 0 100 c9t47d0
185.5 0.0 9778.3 0.0 0.0 7.0 0.0 37.7 0 100 c9t48d0
192.0 0.0 8384.2 0.0 0.0 7.0 0.0 36.4 0 100 c9t49d0
198.5 0.0 8864.7 0.0 0.0 7.0 0.0 35.2 0 100 c9t50d0
192.0 0.0 9369.8 0.0 0.0 7.0 0.0 36.4 0 100 c9t51d0
182.5 0.0 8825.7 0.0 0.0 7.0 0.0 38.3 0 100 c9t52d0
202.0 0.0 7387.9 0.0 0.0 7.0 0.0 34.6 0 100 c9t55d0
...and sure enough about 20 minutes into it I get this (bus reset?):
scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,6...@4/
pci1000,3...@0/s...@34,0 (sd49):
incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,6...@4/
pci1000,3...@0/s...@21,0 (sd30):
incomplete read- retrying
scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,6...@4/
pci1000,3...@0/s...@1e,0 (sd27):
incomplete read- retrying
scsi: [ID 365881 kern.info] /p...@0,0/pci8086,6...@4/pci1000,3...@0
(mpt0):
Rev. 8 LSI, Inc. 1068E found.
scsi: [ID 365881 kern.info] /p...@0,0/pci8086,6...@4/pci1000,3...@0
(mpt0):
mpt0 supports power management.
scsi: [ID 365881 kern.info] /p...@0,0/pci8086,6...@4/pci1000,3...@0
(mpt0):
mpt0: IOC Operational.
During the "bus reset", iostat output looked like this:
extended device statistics ----
errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
trn tot device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c8
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c8t0d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c8t1d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c8t2d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c8t3d0
0.0 0.0 0.0 0.0 0.0 88.0 0.0 0.0 0 2200 0
3 0 3 c9
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t8d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t9d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t10d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t11d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t12d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t13d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t14d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t15d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t16d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t17d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t18d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t19d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t20d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t21d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t22d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t23d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t24d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t25d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t26d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t27d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t28d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t29d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
1 0 1 c9t30d0
OK, here we see 4 I/Os pending outside of the host. The host has
sent them on and is waiting for them to return. This means they are
getting dropped either at the disk or somewhere between the disk
and the controller.
When this happens, the sd driver will time them out, try to clear
the fault by reset, and retry. In other words, the resets you see
are when the system tries to recover.
Since there are many disks with 4 stuck I/Os, I would lean towards
a common cause. What do these disks have in common? Firmware?
Do they share a SAS expander?
-- richard
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t31d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t32d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
1 0 1 c9t33d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t34d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t35d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t36d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t37d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t38d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t39d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t40d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t41d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t42d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t43d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t44d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t45d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t46d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t47d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t48d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t49d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t50d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
0 0 0 c9t51d0
0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0 100 0
1 0 1 c9t52d0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0
0 0 0 c9t55d0
During our previous testing, we had tried even setting this
max_pending value down to 1, but we still hit the problem (albeit it
took a little longer to hit it) and I couldn't find anything else I
could set to throttle IO to the disk, hence the frustration.
If you hadn't seen this output, would you say that 7 was a
"reasonable" value for that max_pending queue for our architecture
and should give the LSI controller in this situation enough
breathing room to operate? If so, I *should* be able to scrub the
disks successfully (ZFS isn't to blame) and therefore have to point
the finger at the mpt-driver/LSI-firmware/disk-firmware instead.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss