The simple answer is that you are running a stress test and there is some finite processing being done separating traffic into queues which is showing some performance impact on your setup. At that rate you could also be running into limitations of the PCIe bus and the added latency of going across the QPI bus if you're using cores on remote CPUs.
Depending on the driver and kernel version, we've seen best small packet performance at around 3 queues. Adding queues increases the processing required to separate the traffic into queues which, at some point, will not gain you anything. Since you're able to do 10G, it sounds like most of the system is properly working and the trick is to tune the system for whatever it is you're planning to use it for. Thanks. Todd Fujinaka Software Application Engineer Networking Division (ND) Intel Corporation [email protected] (503) 712-4565 -----Original Message----- From: Shinae Woo [mailto:[email protected]] Sent: Wednesday, March 20, 2013 8:59 PM To: [email protected] Subject: [E1000-devel] Low receive performance with multiple RSS queue Hello, all, We're observing a weird problem with receive-side scaling (RSS) on a 82599-based Intel NIC. In a nutshell, we do not see 10Gbps for 64B packet RX if we configure the NIC to use multiple (>1) RSS hardware queues while we _do_ see 10Gbps with a single RSS queue (with one CPU core). For a packet size larger than 80 bytes, we achieve a line rate for packet RX (e.g., 10Gbps) regardless of the number of RSS queues. I'm wondering if this is a hardware problem or if we missed anything in the driver.. We use a modified ixgbe driver (called packetshader IO engine) to bypass a severe kernel-level memory management overhead with a small packet size, and use batch processing of received packets (like NAPI). What we observe is summarized as follows, * WIth 1 RSS queue with 1 CPU core, we do not see any single packet loss with input rate of 10Gbps with all 64 bytes * With 6 RSS queues with 6 CPU cores, we see up to 10% packet drops at the NIC (64B packets) - The loss rate increases as we increase # of RSS queues from 2 to 6 - However, even when we see packet drops, rx_descriptor is almost always empty (not full). We use a machine with two Intel Xeon X5690 CPUs and an Intel NIC with 82599 chipsets (Linux 2.6.32-42-server Ubuntu 12.04). PaketShader IO engine is based on ixgbe 2.0.38.2(PSIO : http://shader.kaist.edu/packetshader/io_engine/index.html), but even if we ported PSIO to ixgbe 3.12, we see the similar problem. Also, similar IO libraries like netmap (http://info.iet.unipi.it/~luigi/netmap/), and PF_RING (http://www.ntop.org/products/pf_ring/) show the same trend: the packet drop rate increases as we increase the number of RSS queues and use more CPU cores, which is counterintuitive since more CPU cores should improve the performance. This is why we suspect that the problem is somewhat related to the hardware (RSS) when the packet size is small (< 80 bytes). We have attached file that shows the packet RX performance from PSIO, NetMap, and PF_RING with various packet sizes and numbers of RSS queues. PSIO-0.2 (based on ixgbe 2.0.38.2) PF_RING-5.5.2 (based on ixgbe-3.11.33) NetMap- 20120813 (based on ixgbe in kernel 3.2.9-k2) Please let us know if you have experienced a similar problem or have any clue what's going on. We'll greatly appreciate your help. Regards, Shinae Woo ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
