So maybe a next step is to run zilstat, arcstat, iostat -xe?? (I forget what
people like to use for these params), zpool iostat -v in 4 term windows while
running the same test and try to see what is spiking when that high load
period occurs.
Not sure if there is a better version than this:
http://www.solarisinternals.com/wiki/index.php/Arcstat
Richard's zilstat:
http://blog.richardelling.com/2009/02/zilstat-improved.html
Other arc tools:
http://vt-100.blogspot.com/2010/03/top-with-zfs-arc.html
http://www.cuddletech.com/blog/pivot/entry.php?id=979
On 10/30/10 5:48 AM, Ian D wrote:
I owe you all an update...
We found out a clear pattern we can now recreate at will. Whenever we read/write the
pool, it gives expected throughput and IOPS for a while, but at some point it slows down
to a crawl, nothing is responding and pretty much "hang" for a few seconds and
then things go back to normal for a little more while. Sometimes the problem is barely
noticeable and only happen once every few minutes, at other times it is every few
seconds. We could be doing the exact same operation and sometimes it is fast and
sometimes it is slow. The more clients are connected the worse the problem typically
gets- and no, it's not happening every 30 seconds when things are committed to disk.
Now... every time that slow down occurs, the load on the Nexenta box gets crazy
high- it can reach 35 and more and the console dont even respond anymore. The
rest of the time the load barely reaches 3. The box has four 7500 series Intel
Xeon CPUs and 256G of RAM and use 15K SAS HDDs in mirrored stripes on LSI
9200-8e HBAs- so we're certainly not underpowered. We also have the same issue
when using a box with two of the latest AMD Opteron CPUs (the Magny-Cours) and
128G of RAM.
We are able to reach 800MB/sec and more over the network when things go well,
but the average get destroyed by the slow downs when there is zero throughput.
These tests are run without any L2ARC or SLOG, but past tests have shown the
same issue when using them. We've tried with 12x 100G Samsung SLC SSDs and
DDRDrive X1s among other thing- and while they make the whole thing much
faster, they don't prevent those intermittent slow downs from happening.
Our next step is to isolate the process that take all that CPU...
Ian
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss