On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at dpdk.org on behalf of keith.wiles at intel.com> wrote:
> >On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: > >>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at intel.com> >>wrote: >> >>> >>> Right now I do not know what the issue is with the system. Could be too >>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, which >>> is why you get better performance when you have 8 core per port. I am not >>> really seeing the whole picture and how DPDK is configured to help more. >>> Sorry. >> >>I doubt that there is a limitation wrt running 16 cores per port vs 8 >>cores per port as I've tried with two different machines connected >>back to back each with one X710 port and 16 cores on each of them >>running on that port. In that case our performance doubled as >>expected. >> >>> >>> Maybe seeing the DPDK command line would help. >> >>The command line I use with ports 01:00.3 and 81:00.3 is: >>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >> >>Our own qmap args allow the user to control exactly how cores are >>split between ports. In this case we end up with: >> >>warp17> show port map >>Port 0[socket: 0]: >> Core 4[socket:0] (Tx: 0, Rx: 0) >> Core 5[socket:0] (Tx: 1, Rx: 1) >> Core 6[socket:0] (Tx: 2, Rx: 2) >> Core 7[socket:0] (Tx: 3, Rx: 3) >> Core 8[socket:0] (Tx: 4, Rx: 4) >> Core 9[socket:0] (Tx: 5, Rx: 5) >> Core 20[socket:0] (Tx: 6, Rx: 6) >> Core 21[socket:0] (Tx: 7, Rx: 7) >> Core 22[socket:0] (Tx: 8, Rx: 8) >> Core 23[socket:0] (Tx: 9, Rx: 9) >> Core 24[socket:0] (Tx: 10, Rx: 10) >> Core 25[socket:0] (Tx: 11, Rx: 11) >> Core 26[socket:0] (Tx: 12, Rx: 12) >> Core 27[socket:0] (Tx: 13, Rx: 13) >> Core 28[socket:0] (Tx: 14, Rx: 14) >> Core 29[socket:0] (Tx: 15, Rx: 15) >> >>Port 1[socket: 1]: >> Core 10[socket:1] (Tx: 0, Rx: 0) >> Core 11[socket:1] (Tx: 1, Rx: 1) >> Core 12[socket:1] (Tx: 2, Rx: 2) >> Core 13[socket:1] (Tx: 3, Rx: 3) >> Core 14[socket:1] (Tx: 4, Rx: 4) >> Core 15[socket:1] (Tx: 5, Rx: 5) >> Core 16[socket:1] (Tx: 6, Rx: 6) >> Core 17[socket:1] (Tx: 7, Rx: 7) >> Core 18[socket:1] (Tx: 8, Rx: 8) >> Core 19[socket:1] (Tx: 9, Rx: 9) >> Core 30[socket:1] (Tx: 10, Rx: 10) >> Core 31[socket:1] (Tx: 11, Rx: 11) >> Core 32[socket:1] (Tx: 12, Rx: 12) >> Core 33[socket:1] (Tx: 13, Rx: 13) >> Core 34[socket:1] (Tx: 14, Rx: 14) >> Core 35[socket:1] (Tx: 15, Rx: 15) > >On each socket you have 10 physical cores or 20 lcores per socket for 40 >lcores total. > >The above is listing the LCORES (or hyper-threads) and not COREs, which I >understand some like to think they are interchangeable. The problem is the >hyper-threads are logically interchangeable, but not performance wise. If you >have two run-to-completion threads on a single physical core each on a >different hyper-thread of that core [0,1], then the second lcore or thread (1) >on that physical core will only get at most about 30-20% of the CPU cycles. >Normally it is much less, unless you tune the code to make sure each thread is >not trying to share the internal execution units, but some internal execution >units are always shared. > >To get the best performance when hyper-threading is enable is to not run both >threads on a single physical core, but only run one hyper-thread-0. > >In the table below the table lists the physical core id and each of the lcore >ids per socket. Use the first lcore per socket for the best performance: >Core 1 [1, 21] [11, 31] >Use lcore 1 or 11 depending on the socket you are on. > >The info below is most likely the best performance and utilization of your >system. If I got the values right ? > >./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 > >Port 0[socket: 0]: > Core 2[socket:0] (Tx: 0, Rx: 0) > Core 3[socket:0] (Tx: 1, Rx: 1) > Core 4[socket:0] (Tx: 2, Rx: 2) > Core 5[socket:0] (Tx: 3, Rx: 3) > Core 6[socket:0] (Tx: 4, Rx: 4) > Core 7[socket:0] (Tx: 5, Rx: 5) > Core 8[socket:0] (Tx: 6, Rx: 6) > Core 9[socket:0] (Tx: 7, Rx: 7) > >8 cores on first socket leaving 0-1 lcores for Linux. 9 cores and leaving the first core or two lcores for Linux > >Port 1[socket: 1]: > Core 10[socket:1] (Tx: 0, Rx: 0) > Core 11[socket:1] (Tx: 1, Rx: 1) > Core 12[socket:1] (Tx: 2, Rx: 2) > Core 13[socket:1] (Tx: 3, Rx: 3) > Core 14[socket:1] (Tx: 4, Rx: 4) > Core 15[socket:1] (Tx: 5, Rx: 5) > Core 16[socket:1] (Tx: 6, Rx: 6) > Core 17[socket:1] (Tx: 7, Rx: 7) > Core 18[socket:1] (Tx: 8, Rx: 8) > Core 19[socket:1] (Tx: 9, Rx: 9) > >All 10 cores on the second socket. > >++Keith > >> >>Just for reference, the cpu_layout script shows: >>$ $RTE_SDK/tools/cpu_layout.py >>============================================================ >>Core and Socket Information (as reported by '/proc/cpuinfo') >>============================================================ >> >>cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>sockets = [0, 1] >> >> Socket 0 Socket 1 >> -------- -------- >>Core 0 [0, 20] [10, 30] >>Core 1 [1, 21] [11, 31] >>Core 2 [2, 22] [12, 32] >>Core 3 [3, 23] [13, 33] >>Core 4 [4, 24] [14, 34] >>Core 8 [5, 25] [15, 35] >>Core 9 [6, 26] [16, 36] >>Core 10 [7, 27] [17, 37] >>Core 11 [8, 28] [18, 38] >>Core 12 [9, 29] [19, 39] >> >>I know it might be complicated to gigure out exactly what's happening >>in our setup with our own code so please let me know if you need >>additional information. >> >>I appreciate the help! >> >>Thanks, >>Dumitru >> > > > >