[dpdk-dev] Performance hit - NICs on different CPU sockets

Wiles, Keith Thu, 16 Jun 2016 16:59:00 +0000

On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at dpdk.org 
on behalf of keith.wiles at intel.com> wrote:


>
>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote:
>
>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at intel.com> 
>>wrote:
>>
>>>
>>> Right now I do not know what the issue is with the system. Could be too 
>>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, which 
>>> is why you get better performance when you have 8 core per port. I am not 
>>> really seeing the whole picture and how DPDK is configured to help more. 
>>> Sorry.
>>
>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>cores per port as I've tried with two different machines connected
>>back to back each with one X710 port and 16 cores on each of them
>>running on that port. In that case our performance doubled as
>>expected.
>>
>>>
>>> Maybe seeing the DPDK command line would help.
>>
>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>
>>Our own qmap args allow the user to control exactly how cores are
>>split between ports. In this case we end up with:
>>
>>warp17> show port map
>>Port 0[socket: 0]:
>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>
>>Port 1[socket: 1]:
>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>
>On each socket you have 10 physical cores or 20 lcores per socket for 40 
>lcores total.
>
>The above is listing the LCORES (or hyper-threads) and not COREs, which I 
>understand some like to think they are interchangeable. The problem is the 
>hyper-threads are logically interchangeable, but not performance wise. If you 
>have two run-to-completion threads on a single physical core each on a 
>different hyper-thread of that core [0,1], then the second lcore or thread (1) 
>on that physical core will only get at most about 30-20% of the CPU cycles. 
>Normally it is much less, unless you tune the code to make sure each thread is 
>not trying to share the internal execution units, but some internal execution 
>units are always shared.
>
>To get the best performance when hyper-threading is enable is to not run both 
>threads on a single physical core, but only run one hyper-thread-0.
>
>In the table below the table lists the physical core id and each of the lcore 
>ids per socket. Use the first lcore per socket for the best performance:
>Core 1 [1, 21]    [11, 31]
>Use lcore 1 or 11 depending on the socket you are on.
>
>The info below is most likely the best performance and utilization of your 
>system. If I got the values right ?
>
>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>
>Port 0[socket: 0]:
>   Core 2[socket:0] (Tx: 0, Rx: 0)
>   Core 3[socket:0] (Tx: 1, Rx: 1)
>   Core 4[socket:0] (Tx: 2, Rx: 2)
>   Core 5[socket:0] (Tx: 3, Rx: 3)
>   Core 6[socket:0] (Tx: 4, Rx: 4)
>   Core 7[socket:0] (Tx: 5, Rx: 5)
>   Core 8[socket:0] (Tx: 6, Rx: 6)
>   Core 9[socket:0] (Tx: 7, Rx: 7)
>
>8 cores on first socket leaving 0-1 lcores for Linux.

9 cores and leaving the first core or two lcores for Linux
>
>Port 1[socket: 1]:
>   Core 10[socket:1] (Tx: 0, Rx: 0)
>   Core 11[socket:1] (Tx: 1, Rx: 1)
>   Core 12[socket:1] (Tx: 2, Rx: 2)
>   Core 13[socket:1] (Tx: 3, Rx: 3)
>   Core 14[socket:1] (Tx: 4, Rx: 4)
>   Core 15[socket:1] (Tx: 5, Rx: 5)
>   Core 16[socket:1] (Tx: 6, Rx: 6)
>   Core 17[socket:1] (Tx: 7, Rx: 7)
>   Core 18[socket:1] (Tx: 8, Rx: 8)
>   Core 19[socket:1] (Tx: 9, Rx: 9)
>
>All 10 cores on the second socket.
>
>++Keith
>
>>
>>Just for reference, the cpu_layout script shows:
>>$ $RTE_SDK/tools/cpu_layout.py
>>============================================================
>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>============================================================
>>
>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>sockets =  [0, 1]
>>
>>        Socket 0        Socket 1
>>        --------        --------
>>Core 0  [0, 20]         [10, 30]
>>Core 1  [1, 21]         [11, 31]
>>Core 2  [2, 22]         [12, 32]
>>Core 3  [3, 23]         [13, 33]
>>Core 4  [4, 24]         [14, 34]
>>Core 8  [5, 25]         [15, 35]
>>Core 9  [6, 26]         [16, 36]
>>Core 10 [7, 27]         [17, 37]
>>Core 11 [8, 28]         [18, 38]
>>Core 12 [9, 29]         [19, 39]
>>
>>I know it might be complicated to gigure out exactly what's happening
>>in our setup with our own code so please let me know if you need
>>additional information.
>>
>>I appreciate the help!
>>
>>Thanks,
>>Dumitru
>>
>
>
>
>

[dpdk-dev] Performance hit - NICs on different CPU sockets

Reply via email to