Charles,
Is it just ZFS hanging (or what it appears to be is slowing down or
blocking) or does the whole system hang?
A couple of questions
What does iostat show during the time period of the slowdown?
What does mpstat show during the time of the slowdown?
You can look at the metadata statistics by running the following.
echo ::arc | mdb -k
When looking at a ZFS problem, I usually like to gather
echo ::spa | mdb -k
echo ::zio_state | mdb -k
I suspect you could drill down more with dtrace or lockstat to see where
the slowdown is happening.
Dave
On 08/30/10 11:02, Charles J. Knipe wrote:
Howdy,
We're having a ZFS performance issue over here that I was hoping you guys could
help me troubleshoot. We have a ZFS pool made up of 24 disks, arranged into 7
raid-z devices of 4 disks each. We're using it as an iSCSI back-end for VMWare
and some Oracle RAC clusters.
Under normal circumstances performance is very good both in benchmarks and
under real-world use. Every couple days, however, I/O seems to hang for
anywhere between several seconds and several minutes. The hang seems to be a
complete stop of all write I/O. The following zpool iostat illustrates:
pool0 2.47T 5.13T 120 0 293K 0
pool0 2.47T 5.13T 127 0 308K 0
pool0 2.47T 5.13T 131 0 322K 0
pool0 2.47T 5.13T 144 0 347K 0
pool0 2.47T 5.13T 135 0 331K 0
pool0 2.47T 5.13T 122 0 295K 0
pool0 2.47T 5.13T 135 0 330K 0
While this is going on our VMs all hang, as do any "zfs create" commands or attempts to
touch/create files in the zfs pool from the local system. After several minutes the system
"un-hangs" and we see very high write rates before things return to normal across the
board.
Some more information about our configuration: We're running OpenSolaris
svn-134. ZFS is at version 22. Our disks are 15kRPM 300gb Seagate Cheetahs,
mounted in Promise J610S Dual enclosures, hanging off a Dell SAS 5/e
controller. We'd tried out most of this configuration previously on
OpenSolaris 2009.06 without running into this problem. The only thing that's
new, aside from the newer OpenSolaris/ZFS is a set of four SSDs configured as
log disks.
At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD
log disks, but we've seen the problem with those removed, as well.
Has anyone seen anything like this before? Are there any tools we can use to
gather information during the hang which might be useful in determining what's
going wrong?
Thanks for any insights you may have.
-Charles
--
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss