I have an application that consists of a server and clients which communicate via TCP. The communication consists of exchanging many small packets - around 96 bytes of payload (non-uniform). I'm testing it in 8-CURRENT with two machines connected directly via a patch cable. The server is a 8-core Xeon system with 5000X chipset and embedded Intel PRO/1000 EB NIC (card=0x109615d9 chip=0x10968086) and the client is a 4-core desktop Core2 also with Intel's embedded NIC, 82566DM-2. The client and server applications are multithreaded and verified that they scale well in environments like this. I've verified with iperf that the NICs and the cable are ok with gigabit traffic.
Here's an example netstat trace from a test run on the client: input (Total) output packets errs bytes packets errs bytes colls 161700 0 29374199 161768 0 16290562 0 158320 0 28741763 158405 0 15962986 0 157617 0 28614088 157696 0 15889426 0 157569 0 28618951 157674 0 15884576 0 (i.e. the client receives about 2x the data it sends) I've noticed something strange: the server is bottlenecked with "em1 taskq" kernel thread taking 100% of a CPU core, while the global CPU utilization is around 50%, but the client's em0 taskq thread for this same load is ~~ 10% (with > 30% idle). The client CPU is a bit faster then the server (2.4 GHz vs 2.0 GHz) but I don't think this can account for such a big difference. Toggling TSO on the server doesn't help. This difference in taskq CPU load between the client and the server machine looks wrong to me. Also, I'd expect more PPS here. Can someone comment on this? Are there any known issues with the server NIC I have there? (Both machines run amd64 kernels, WITNESS & INVARIANTS are disabled on both machines).
signature.asc
Description: OpenPGP digital signature