Xiaozhou, I will take this thread offline and mail you. I promise to post the solution back in the list for future reference. I do not want to spam everyone..
Thx! From: Xiaozhou Li [mailto:x...@cs.princeton.edu] Sent: Friday, August 14, 2015 7:11 AM To: Gilad Berman <giladb at mellanox.com> Cc: Xu, Qian Q <qian.q.xu at intel.com>; dev at dpdk.org Subject: Re: [dpdk-dev] Performance issues with Mellanox Connectx-3 EN Hi Qian and Gilad, Thanks for your reply. We are using dpdk-2.0.0 and mlnx-en-2.4-1.0.0.1 on a Mellanox Connectx-3 EN with a single 40G port. I ran testpmd on the server with following commands: sudo ./testpmd -c 0xff -n 4 -- -i --portmask=0x1 --port-topology=chained --rxq=4 --txq=4 --nb-cores=4; set fwd macswap I have multiple clients send packets and receive replies. The server throughput is still only about 2Mpps. Testpmd shows no RX-dropped packet, but "ifconfig port" shows many dropped packets. Please let me know if I am doing anything wrong and what else should I check. I am also copying the output when starting testpmd at the end of this email. Not sure if there is any useful information. Thanks! Xiaozhou EAL: Detected lcore 0 as core 0 on socket 0 ... (omit) ... EAL: Detected 32 lcore(s) EAL: VFIO modules not all loaded, skip VFIO support... EAL: Setting up memory... ... (omit) ... EAL: Ask a virtual area of 0xa00000 bytes EAL: Virtual area found at 0x7f2d2fe00000 (size = 0xa00000) EAL: Requesting 8192 pages of size 2MB from socket 0 EAL: Requesting 8192 pages of size 2MB from socket 1 EAL: TSC frequency is ~2199994 KHz EAL: Master lcore 0 is ready (tid=39add900;cpuset=[0]) PMD: ENICPMD trace: rte_enic_pmd_init EAL: lcore 4 is ready (tid=3676b700;cpuset=[4]) EAL: lcore 6 is ready (tid=35769700;cpuset=[6]) EAL: lcore 5 is ready (tid=35f6a700;cpuset=[5]) EAL: lcore 2 is ready (tid=3776d700;cpuset=[2]) EAL: lcore 1 is ready (tid=37f6e700;cpuset=[1]) EAL: lcore 3 is ready (tid=36f6c700;cpuset=[3]) EAL: lcore 7 is ready (tid=34f68700;cpuset=[7]) EAL: PCI device 0000:04:00.0 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd EAL: Not managed by a supported kernel driver, skipped EAL: PCI device 0000:04:00.1 on NUMA socket 0 EAL: probe driver: 8086:1521 rte_igb_pmd EAL: Not managed by a supported kernel driver, skipped EAL: PCI device 0000:06:00.0 on NUMA socket 0 EAL: probe driver: 15b3:1003 librte_pmd_mlx4 PMD: librte_pmd_mlx4: PCI information matches, using device "mlx4_0" (VF: false) PMD: librte_pmd_mlx4: 1 port(s) detected PMD: librte_pmd_mlx4: port 1 MAC address is f4:52:14:5a:8f:70 EAL: PCI device 0000:81:00.0 on NUMA socket 1 EAL: probe driver: 8086:1528 rte_ixgbe_pmd EAL: Not managed by a supported kernel driver, skipped EAL: PCI device 0000:81:00.1 on NUMA socket 1 EAL: probe driver: 8086:1528 rte_ixgbe_pmd EAL: Not managed by a supported kernel driver, skipped Interactive-mode selected Configuring Port 0 (socket 0) PMD: librte_pmd_mlx4: 0x884360: TX queues number update: 0 -> 4 PMD: librte_pmd_mlx4: 0x884360: RX queues number update: 0 -> 4 Port 0: F4:52:14:5A:8F:70 Checking link statuses... Port 0 Link Up - speed 40000 Mbps - full-duplex Done testpmd> show config rxtx macswap packet forwarding - CRC stripping disabled - packets/burst=32 nb forwarding cores=4 - nb forwarding ports=1 RX queues=4 - RX desc=128 - RX free threshold=0 RX threshold registers: pthresh=0 hthresh=0 wthresh=0 TX queues=4 - TX desc=512 - TX free threshold=0 TX threshold registers: pthresh=0 hthresh=0 wthresh=0 TX RS bit threshold=0 - TXQ flags=0x0 testpmd> show config fwd macswap packet forwarding - ports=1 - cores=4 - streams=4 - NUMA support disabled, MP over anonymous pages disabled Logical Core 1 (socket 0) forwards packets on 1 streams: RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00 Logical Core 2 (socket 0) forwards packets on 1 streams: RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00 Logical Core 3 (socket 0) forwards packets on 1 streams: RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00 Logical Core 4 (socket 0) forwards packets on 1 streams: RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00 On Thu, Aug 13, 2015 at 6:13 AM, Gilad Berman <giladb at mellanox.com<mailto:giladb at mellanox.com>> wrote: Xiaozhou, Following Qian answer - 2Mpps is VERY (VERY) low and far below what we see even with single core. Which version of DPDK and PMD are you using? Are you using MLNX optimized libs for PMD? Can you provide more details on the exact setup? Can you run a simple test with testpmd and see if you are getting the same results? Just to be clear - it does not matter which version you are using, 2Mpps is very far from what you should get :) -----Original Message----- From: dev [mailto:dev-bounces at dpdk.org<mailto:dev-boun...@dpdk.org>] On Behalf Of Xu, Qian Q Sent: Thursday, August 13, 2015 6:25 AM To: Xiaozhou Li <xl at CS.Princeton.EDU<mailto:xl at CS.Princeton.EDU>>; dev at dpdk.org<mailto:dev at dpdk.org> Subject: Re: [dpdk-dev] Performance issues with Mellanox Connectx-3 EN Xiaozhou So seems the performance bottleneck is not at the core, have you checked that the Mellanox NIC's configuration? How many queues per port are you using? Could you try l3fwd example with Mellanox to check if the performance is good enough? I'm not familiar with Mellanox NIC, but if you have tried Intel Fortville 40G NIC, I can give more suggestions about the NIC's configurations. Thanks Qian -----Original Message----- From: dev [mailto:dev-bounces at dpdk.org<mailto:dev-boun...@dpdk.org>] On Behalf Of Xiaozhou Li Sent: Thursday, August 13, 2015 7:20 AM To: dev at dpdk.org<mailto:dev at dpdk.org> Subject: [dpdk-dev] Performance issues with Mellanox Connectx-3 EN Hi folks, I am getting performance scalability issues with DPDK on Mellanox Connectx-3 . Each of our machine has 16 cores and a single-port 40G Mellanox Connectx-3 EN. We find out the server throughput *does not scale* with number of cores. With a single thread on one core, we can get about 2 Mpps with a simple echo server implementation. However, the performance number does not increase as we use more cores. Our implementation is based on the l2fwd example. I'd greatly appreciate it if anyone could provide some insights on what might be the problem and how can we improve the performance with Mellanox Connectx-3 EN. Thanks! Best, Xiaozhou