I have a fast, 8 core x 2 hyperthread  Nehalem system, 48 gigs of memory. 
During network throughput testing ( multiple ftp server instances running,  
transferring 200 megabytes /sec on 2x GIG-E interfaces)  the system 
periodically becomes extremely sluggish.  (I can barely type in commands from 
the console, and network throughput drops down to nothing).  Once the system 
gets into this state it will not recover unless we kill the network load. 

I see run queue values 21 10 5, several 100 mf-s in vmsstat, but otherwise 
little activity.
Occasionally I see 100% kernel/system activity for seconds at a time.

Trying to figure out who is using cpu in kernel via dtrace profiling scripts 
for 10 secs at a time:
e.g.: dtrace -n 'profile:::profile-3456 /arg0/ { @[stack(1)] = count(); }' 
but I get the dtrace watchdog abort "Abort due to systemic unresponsiveness"
Tried to force the script to run via -w, and I just see a very low count 
( 1 - 2 max) of seemingly random functions sampled.

I wonder if there is perhaps a hardware issue, that prevents the dtrace 
sampling interrupts from being run.   I tried to see what is going on with a 
variety of other tools (all the various *stat commands) but fail to see 
anything obvious, other than the run queue and the occasional 100% kernel.  I 
typically see an almost idle system, no lock contention, no io wait, low system 
call, context switch and stmx counts. 
  
Any suggestions regarding what tool /dtrace script to use, or where to look to 
get to the bottom of the sluggishness would be much appreciated.   

TIA

Steve
-- 
This message posted from opensolaris.org
_______________________________________________
dtrace-discuss mailing list
dtrace-discuss@opensolaris.org

Reply via email to