[dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD

Zoltan Kiss Wed, 26 Aug 2015 19:49:45 +0100

Hi,

I've checked it further, based on Stephen's suggestion I've tried perf 
top as well. The results were the same, it spends a lot of time in that 
part of the code, and there are high number of branch load misses 
(BR_MISS_PRED_RETIRED) around there too.
I've also started to strip down miniflow_extract() to remove parts which 
are not relevant to this very simple testcase. I've removed the metadata 
checking branches and the "size < sizeof(struct eth_header)". I've 
removed the size check from emc_processing, and placed log messages in 
flow_extract and netdev_flow_key_from_flow, to make sure the excessive 
time spent in miniflow_extract is not because these two are somehow 
calling it.
That way I've closed out all of the branches preceding this instruction. 
Oddly the high sample number now moved down a few instructions:
...
dp_packet_reset_offsets
   5113eb:       b8 ff ff ff ff          mov    $0xffffffff,%eax
   5113f0:       66 89 8f 86 00 00 00    mov    %cx,0x86(%rdi)
   5113f7:       c6 87 81 00 00 00 00    movb   $0x0,0x81(%rdi)
   5113fe:       66 89 87 82 00 00 00    mov    %ax,0x82(%rdi)
data_pull
   511405:       48 8d 4d 0c             lea    0xc(%rbp),%rcx
dp_packet_reset_offsets
   511409:       66 89 97 84 00 00 00    mov    %dx,0x84(%rdi)
memcpy
   511410:       48 8b 45 00             mov    0x0(%rbp),%rax
   511414:       48 89 46 18             mov    %rax,0x18(%rsi)


This last instruction moves the first 8 bytes of the MAC address (coming 
from 0x0(%rbp)) to 0x18(%rsi), which is basically memory pointed by 
parameter "struct miniflow *dst". It is allocated on the stack by 
emc_processing.
I couldn't find any branch which can cause this miss, but then I've 
checked the PMD stats:

pmd thread numa_id 0 core_id 1:
        emc hits:4395834176
        megaflow hits:1
        miss:1
        lost:0
        polling cycles:166083129380 (16.65%)
        processing cycles:831536059972 (83.35%)
        avg cycles per packet: 226.95 (997619189352/4395834178)
        avg processing cycles per packet: 189.16 (831536059972/4395834178)

So everything hits EMC, when I measured the change of that counter for 
10 seconds, the result was around ~13.3 Mpps too. The cycle statistics 
shows that it should be able to handle more than 15M packets per second, 
yet it doesn't receive that much, while with the non-vector PMD it can 
max out the link.
Any more suggestions?

Regards,

Zoltan


On 21/08/15 19:05, Zoltan Kiss wrote:
> Hi,
>
> I've set up a simple packet forwarding perf test on a dual-port 10G
> 82599ES: one port receives 64 byte UDP packets, the other sends it out,
> one core used. I've used latest OVS with DPDK 2.1, and the first result
> was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
> year with the same test. The first thing I've changed was to revert back
> to the old behaviour about this issue:
>
> http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731
>
> So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
> That increased the performance to 13.5, but to figure out what's wrong
> started to play with the receive functions. First I've disabled vector
> PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
> then I've enabled scattered RX, and with
> ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
> is I guess as close as possible to the 14.2 line rate (on my HW at
> least, with one core)
> Does anyone has a good explanation about why the vector PMD performs so
> significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
> should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
> difference.
> I've tried to look into it with oprofile, but the results were quite
> strange: 35% of the samples were from miniflow_extract, the part where
> parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
> snippet (1M samples):
>
>    511454 19        0.0037  flow.c:511
>    511458 149       0.0292  dp-packet.h:266
>    51145f 4264      0.8357  dp-packet.h:267
>    511466 18        0.0035  dp-packet.h:268
>    51146d 43        0.0084  dp-packet.h:269
>    511474 172       0.0337  flow.c:511
>    51147a 4320      0.8467  string3.h:51
>    51147e 358763   70.3176  flow.c:99
>    511482 2        3.9e-04  string3.h:51
>    511485 3060      0.5998  string3.h:51
>    511488 1693      0.3318  string3.h:51
>    51148c 2933      0.5749  flow.c:326
>    511491 47        0.0092  flow.c:326
>
> And the corresponding disassembled code:
>
>    511454:       49 83 f9 0d             cmp    r9,0xd
>    511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
>    51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
>    511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
>    51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
>    511474:       0f 86 af 01 00 00       jbe    511629
> <miniflow_extract+0x279>
>    51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
>    51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
>    511482:       49 89 00                mov    QWORD PTR [r8],rax
>    511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
>    511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
>    51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
>    511491:       66 41 81 fa 81 00       cmp    r10w,0x81
>
> My only explanation to this so far is that I misunderstand something
> about the oprofile results.
>
> Regards,
>
> Zoltan

[dpdk-dev] OVS-DPDK performance problem on ixgbe vector PMD

Reply via email to