On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote:
>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles at intel.com> wrote: >> >> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at >> dpdk.org on behalf of keith.wiles at intel.com> wrote: >> >>> >>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: >>> >>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at intel.com> >>>>wrote: >>>> >>>>> >>>>> Right now I do not know what the issue is with the system. Could be too >>>>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, which >>>>> is why you get better performance when you have 8 core per port. I am not >>>>> really seeing the whole picture and how DPDK is configured to help more. >>>>> Sorry. >>>> >>>>I doubt that there is a limitation wrt running 16 cores per port vs 8 >>>>cores per port as I've tried with two different machines connected >>>>back to back each with one X710 port and 16 cores on each of them >>>>running on that port. In that case our performance doubled as >>>>expected. >>>> >>>>> >>>>> Maybe seeing the DPDK command line would help. >>>> >>>>The command line I use with ports 01:00.3 and 81:00.3 is: >>>>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >>>> >>>>Our own qmap args allow the user to control exactly how cores are >>>>split between ports. In this case we end up with: >>>> >>>>warp17> show port map >>>>Port 0[socket: 0]: >>>> Core 4[socket:0] (Tx: 0, Rx: 0) >>>> Core 5[socket:0] (Tx: 1, Rx: 1) >>>> Core 6[socket:0] (Tx: 2, Rx: 2) >>>> Core 7[socket:0] (Tx: 3, Rx: 3) >>>> Core 8[socket:0] (Tx: 4, Rx: 4) >>>> Core 9[socket:0] (Tx: 5, Rx: 5) >>>> Core 20[socket:0] (Tx: 6, Rx: 6) >>>> Core 21[socket:0] (Tx: 7, Rx: 7) >>>> Core 22[socket:0] (Tx: 8, Rx: 8) >>>> Core 23[socket:0] (Tx: 9, Rx: 9) >>>> Core 24[socket:0] (Tx: 10, Rx: 10) >>>> Core 25[socket:0] (Tx: 11, Rx: 11) >>>> Core 26[socket:0] (Tx: 12, Rx: 12) >>>> Core 27[socket:0] (Tx: 13, Rx: 13) >>>> Core 28[socket:0] (Tx: 14, Rx: 14) >>>> Core 29[socket:0] (Tx: 15, Rx: 15) >>>> >>>>Port 1[socket: 1]: >>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>> Core 30[socket:1] (Tx: 10, Rx: 10) >>>> Core 31[socket:1] (Tx: 11, Rx: 11) >>>> Core 32[socket:1] (Tx: 12, Rx: 12) >>>> Core 33[socket:1] (Tx: 13, Rx: 13) >>>> Core 34[socket:1] (Tx: 14, Rx: 14) >>>> Core 35[socket:1] (Tx: 15, Rx: 15) >>> >>>On each socket you have 10 physical cores or 20 lcores per socket for 40 >>>lcores total. >>> >>>The above is listing the LCORES (or hyper-threads) and not COREs, which I >>>understand some like to think they are interchangeable. The problem is the >>>hyper-threads are logically interchangeable, but not performance wise. If >>>you have two run-to-completion threads on a single physical core each on a >>>different hyper-thread of that core [0,1], then the second lcore or thread >>>(1) on that physical core will only get at most about 30-20% of the CPU >>>cycles. Normally it is much less, unless you tune the code to make sure each >>>thread is not trying to share the internal execution units, but some >>>internal execution units are always shared. >>> >>>To get the best performance when hyper-threading is enable is to not run >>>both threads on a single physical core, but only run one hyper-thread-0. >>> >>>In the table below the table lists the physical core id and each of the >>>lcore ids per socket. Use the first lcore per socket for the best >>>performance: >>>Core 1 [1, 21] [11, 31] >>>Use lcore 1 or 11 depending on the socket you are on. >>> >>>The info below is most likely the best performance and utilization of your >>>system. If I got the values right ? >>> >>>./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 >>> >>>Port 0[socket: 0]: >>> Core 2[socket:0] (Tx: 0, Rx: 0) >>> Core 3[socket:0] (Tx: 1, Rx: 1) >>> Core 4[socket:0] (Tx: 2, Rx: 2) >>> Core 5[socket:0] (Tx: 3, Rx: 3) >>> Core 6[socket:0] (Tx: 4, Rx: 4) >>> Core 7[socket:0] (Tx: 5, Rx: 5) >>> Core 8[socket:0] (Tx: 6, Rx: 6) >>> Core 9[socket:0] (Tx: 7, Rx: 7) >>> >>>8 cores on first socket leaving 0-1 lcores for Linux. >> >> 9 cores and leaving the first core or two lcores for Linux >>> >>>Port 1[socket: 1]: >>> Core 10[socket:1] (Tx: 0, Rx: 0) >>> Core 11[socket:1] (Tx: 1, Rx: 1) >>> Core 12[socket:1] (Tx: 2, Rx: 2) >>> Core 13[socket:1] (Tx: 3, Rx: 3) >>> Core 14[socket:1] (Tx: 4, Rx: 4) >>> Core 15[socket:1] (Tx: 5, Rx: 5) >>> Core 16[socket:1] (Tx: 6, Rx: 6) >>> Core 17[socket:1] (Tx: 7, Rx: 7) >>> Core 18[socket:1] (Tx: 8, Rx: 8) >>> Core 19[socket:1] (Tx: 9, Rx: 9) >>> >>>All 10 cores on the second socket. > >The values were almost right :) But that's because we reserve the >first two lcores that are passed to dpdk for our own management part. >I was aware that lcores are not physical cores so we don't expect >performance to scale linearly with the number of lcores. However, if >there's a chance that another hyperthread can run while the paired one >is stalling we'd like to take advantage of those cycles if possible. > >Leaving that aside I just ran two more tests while using only one of >the two hwthreads in a core. > >a. 2 ports on different sockets with 8 cores/port: >./build/warp17 -c 0xFF3FF -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 >-- --qmap 0.0x3FC --qmap 1.0xFF000 >warp17> show port map >Port 0[socket: 0]: > Core 2[socket:0] (Tx: 0, Rx: 0) > Core 3[socket:0] (Tx: 1, Rx: 1) > Core 4[socket:0] (Tx: 2, Rx: 2) > Core 5[socket:0] (Tx: 3, Rx: 3) > Core 6[socket:0] (Tx: 4, Rx: 4) > Core 7[socket:0] (Tx: 5, Rx: 5) > Core 8[socket:0] (Tx: 6, Rx: 6) > Core 9[socket:0] (Tx: 7, Rx: 7) > >Port 1[socket: 1]: > Core 12[socket:1] (Tx: 0, Rx: 0) > Core 13[socket:1] (Tx: 1, Rx: 1) > Core 14[socket:1] (Tx: 2, Rx: 2) > Core 15[socket:1] (Tx: 3, Rx: 3) > Core 16[socket:1] (Tx: 4, Rx: 4) > Core 17[socket:1] (Tx: 5, Rx: 5) > Core 18[socket:1] (Tx: 6, Rx: 6) > Core 19[socket:1] (Tx: 7, Rx: 7) > >This gives a session setup rate of only 2M sessions/s. > >b. 2 ports on socket 0 with 4 cores/port: >./build/warp17 -c 0x3FF -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 -- >--qmap 0.0x3C0 --qmap 1.0x03C One more thing to try change the ?m 32768 to ?socket-mem 16384,16384 to make sure the memory is split between the sockets. You may need to remove the /dev/huepages/* files or wherever you put them. What is the dpdk ?n option set to on your system? Mine is set to ??n 4? >warp17> show port map >Port 0[socket: 0]: > Core 6[socket:0] (Tx: 0, Rx: 0) > Core 7[socket:0] (Tx: 1, Rx: 1) > Core 8[socket:0] (Tx: 2, Rx: 2) > Core 9[socket:0] (Tx: 3, Rx: 3) > >Port 1[socket: 0]: > Core 2[socket:0] (Tx: 0, Rx: 0) > Core 3[socket:0] (Tx: 1, Rx: 1) > Core 4[socket:0] (Tx: 2, Rx: 2) > Core 5[socket:0] (Tx: 3, Rx: 3) > >Surprisingly this gives a session setup rate of 3M sess/s!! > >The packet processing cores are totally independent and only access >local socket memory/ports. >There is no locking or atomic variable access in fast path in our code. >The mbuf pools are not shared between cores handling the same port so >there should be no contention when allocating/freeing mbufs. >In this specific test scenario all the cores handling port 0 are >essentially executing the same code (TCP clients) and the cores on >port 1 as well (TCP servers). > >Do you have any tips about what other things to check for? > >Thanks, >Dumitru > > > >>> >>>++Keith >>> >>>> >>>>Just for reference, the cpu_layout script shows: >>>>$ $RTE_SDK/tools/cpu_layout.py >>>>============================================================ >>>>Core and Socket Information (as reported by '/proc/cpuinfo') >>>>============================================================ >>>> >>>>cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>>>sockets = [0, 1] >>>> >>>> Socket 0 Socket 1 >>>> -------- -------- >>>>Core 0 [0, 20] [10, 30] >>>>Core 1 [1, 21] [11, 31] >>>>Core 2 [2, 22] [12, 32] >>>>Core 3 [3, 23] [13, 33] >>>>Core 4 [4, 24] [14, 34] >>>>Core 8 [5, 25] [15, 35] >>>>Core 9 [6, 26] [16, 36] >>>>Core 10 [7, 27] [17, 37] >>>>Core 11 [8, 28] [18, 38] >>>>Core 12 [9, 29] [19, 39] >>>> >>>>I know it might be complicated to gigure out exactly what's happening >>>>in our setup with our own code so please let me know if you need >>>>additional information. >>>> >>>>I appreciate the help! >>>> >>>>Thanks, >>>>Dumitru >>>> >>> >>> >>> >>> >> >> >> >