[perf-discuss] why just one cpu gets loaded so heavily with 4 x 1GigE card?

Somenath Bandyopadhyay Thu, 23 Oct 2008 15:42:16 -0700

(Sorry for posting sol10 networking question here, let me know if I should post 
it elsewhere)


Problem:  We don't see 4 x 1 GigE card producing 4 GigE throughput.

our setup: two nodes (n1, n2) are back to back connected with 4 GigE NIC cards.
Each individual NIC can produce 100MBps throughput.

n1 is the client and n2 is the server. n1 is trying to read data stored in n2's 
memory
without hitting disk.

If I run the same applciation (on all 4 NICs at the same time) then max I get is
200MBps. With 2 NICs I get 150MBps.

I watched that cpu#6 is getting heavily loaded 100% and "ithr" in mpstat is very
high for cpu#6, see below a sample mpstat output.

n2>#mpstat 1 1000
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 0    0   0 71414 42597  241 3660  111  728  188    0  1336    0  46   0  54
 1    0   0 49839 45700    0 4228  100  635  149    0   906    0  40   0  60
 2    0   0 67422 41955    0 1484   47  267  178    0  1243    0  43   0  57
 3    0   0 60928 43176    0 1260   44  198  424    0  1061    0  43   0  57
 4    0   0 27945 47010    3  552    8   63  187    0   571    1  29   0  70
 5    0   0 29726 46722    1  626    7   73   63    0   515    0  27   0  73
 6    0   0    0 52581 1872  387  114   10  344    0     8    0  99   0   1
 7    0   0 48189 44176    0 1077   25  152  150    0   858    0  34   0  66

on n1, processor#6 is loaded 60% , rest of the processors are below 50%.
These results I got through default system parameters.

This happend with the mtu size 1500 with the broadcom GigE nic.
When I use mtu = 9000, then I get throughput close to 3.8Gbps.
cpu#6 is still >90% busy, ithr is still very high on cpu#6...only difference is
that other cpus are also busy (close to 90%).

I tried changing some /etc/system parameters e.g.
*       distribute squeues among all cpus
*       do this when NICs are faster than CPUs
       set ip:ip_squeue_fanout=1

(this was not the case in our setup, we have 8x2.33Ghz processors vs
4x1GigE NIC, still tried this)

*       if number of cpus far more than number of nics
       set ip:tcp_squeue_wput=1
(since this was the case, I tried this, without any improvement)

*       latency sensitive machines should set this to zero
*       default is: worker threads wait for 10ms
*       val=0 means no wait, serve immeditely
       ip:ip_squeue_wait=0


but without effect. Changing set ip:ip_squeue_fanout=1 fails the benchmark to 
run.
tcp connection works otherwise.

1)
So, my question is why is cpu% so high in cpu#6 only?
Though the problem is solved with jumbo frames for 2 machines, if we increase
number of nodes this scalability problem will be seen with 3,4,5 ...machines
(since cpu utilization is very high with current state).

Is there any kernel tunable I should try to distribute the load differently?
Are all TCP connections (and squeues) getting tied with processor #6?
Is there a way to distribute connections among other processors?

2) with 24 TCP connections established, I again see, only cpu 6 has some mblk's
     (and don't see for others)....I didn't capture the mblks for cpu #6 in 
this example though.

is there something wrong here, shouldn't each TCP connection have its own 
squeue?

[3]> ::squeue
            ADDR STATE CPU            FIRST             LAST           WORKER
ffffffff98e199c0 02060   7 0000000000000000 0000000000000000 fffffe8001139c80
ffffffff98e19a80 02060   6 0000000000000000 0000000000000000 fffffe8001133c80
ffffffff98e19b40 02060   5 0000000000000000 0000000000000000 fffffe80010dfc80
ffffffff98e19c00 02060   4 0000000000000000 0000000000000000 fffffe800108bc80
ffffffff98e19cc0 02060   3 0000000000000000 0000000000000000 fffffe8001037c80
ffffffff98e19d80 02060   2 0000000000000000 0000000000000000 fffffe8000fe3c80
ffffffff98e19e40 02060   1 0000000000000000 0000000000000000 fffffe80004ebc80
ffffffff98e19f00 02060   0 0000000000000000 0000000000000000 fffffe8000293c80
[3]> :c

thanks, som
([EMAIL PROTECTED], ph: 650-527-1566)
--
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

[perf-discuss] why just one cpu gets loaded so heavily with 4 x 1GigE card?

Reply via email to