Here is a total guess - but what if it has to do with zfs processing running on one CPU having to talk to the memory "owned" by a different CPU? I don't know if many people are running fully populated boxes like you are, so maybe it is something people are not seeing due to not having huge amounts of ram on these new multichip transport systems.

Maybe you could test it by going to 1 Mangy-Cours CPU and just the memory populated for that one on your AMD box and see if you get the same periodic high loads.


On 10/30/10 5:48 AM, Ian D wrote:
I owe you all an update...

We found out a clear pattern we can now recreate at will.  Whenever we read/write the 
pool, it gives expected throughput and IOPS for a while, but at some point it slows down 
to a crawl, nothing is responding and pretty much "hang" for a few seconds and 
then things go back to normal for a little more while.  Sometimes the problem is barely 
noticeable and only happen once every few minutes, at other times it is every few 
seconds.  We could be doing the exact same operation and sometimes it is fast and 
sometimes it is slow. The more clients are connected the worse the problem typically 
gets- and no, it's not happening every 30 seconds when things are committed to disk.

Now... every time that slow down occurs, the load on the Nexenta box gets crazy 
high- it can reach 35 and more and the console dont even respond anymore.  The 
rest of the time the load barely reaches 3.  The box has four 7500 series Intel 
Xeon CPUs and 256G of RAM and use 15K SAS HDDs in mirrored stripes on LSI 
9200-8e HBAs- so we're certainly not underpowered.  We also have the same issue 
when using a box with two of the latest AMD Opteron CPUs (the Magny-Cours) and 
128G of RAM.

We are able to reach 800MB/sec and more over the network when things go well, 
but the average get destroyed by the slow downs when there is zero throughput.

These tests are run without any L2ARC or SLOG, but past tests have shown the 
same issue when using them.  We've tried with 12x 100G Samsung SLC SSDs and 
DDRDrive X1s among other thing- and while they make the whole thing much 
faster, they don't prevent those intermittent slow downs from happening.

Our next step is to isolate the process that take all that CPU...

Ian
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to