On 15/05/2015 20:50, "Pravin Shelar" <pshe...@nicira.com> wrote:

>On Thu, Apr 23, 2015 at 11:40 AM, Daniele Di Proietto
><diproiet...@vmware.com> wrote:
>> Initializing the dp_packet's metadata can be a hot spot, especially
>> for very simple pipelines.  Therefore improving the code here can
>> sometimes make a difference.
>>
>> Using memcpy instead of a plain assignment helps GCC and clang generate
>> faster code. Here's a comparison of the compiler generated code (GCC
>>4.8)
>> with or without this commit.
>>
>> BEFORE (assignment)                 |     AFTER(memcpy)
>>
>> c8:  add    $0x8,%r8                |   d8:  mov    (%rsi),%r8
>>      mov    (%rcx),%r9              |        mov    (%rdx),%rdi
>>      mov    (%rbx),%r11d            |        add    $0x1,%ecx
>>      mov    %r10,%rcx               |        add    $0x8,%rsi
>>      cmp    %rsi,%r8                |        cmp    -0x870(%rbp),%ecx
>>      lea    0x88(%r9),%rdi          |        mov    %rdi,0x88(%r8)
>>      rep    stos %rax,%es:(%rdi)    |        mov    0x8(%rdx),%rdi
>>      mov    %r11d,0xb8(%r9)         |        lea    0x88(%r8),%rax
>>      mov    %r8,%rcx                |        mov    %rdi,0x90(%r8)
>>      jne    c8                      |        mov    0x10(%rdx),%rdi
>>                                     |        mov    %rdi,0x98(%r8)
>>                                     |        mov    0x18(%rdx),%rdi
>>                                     |        mov    %rdi,0xa0(%r8)
>>                                     |        mov    0x20(%rdx),%r8
>>                                     |        mov    %r8,0x20(%rax)
>>                                     |        mov    0x28(%rdx),%r8
>>                                     |        mov    %r8,0x28(%rax)
>>                                     |        mov    0x30(%rdx),%r8
>>                                     |        mov    %r8,0x30(%rax)
>>                                     |        jl     d8
>>
>> The old code uses a 'rep stos' and fetches the 'port_no' value from
>> the 'port' member at every iteration ('mov (%rbx),%r11d'), while the
>> new code uses a series of mov operation to accomplish everything.
>>
>> I can measure a through improvement of ~7% on a single flow phy-phy test
>> with 64 bytes UDP packets.
>>
>> The improvement has been observed on an Intel Xeon Sandy Bridge (2012)
>> and on an Intel Xeon Westmere (2010).
>>
>> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
>> ---
>>  lib/dpif-netdev.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index f1d65f5..7d55997 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -2507,13 +2507,16 @@ dp_netdev_process_rxq_port(struct
>>dp_netdev_pmd_thread *pmd,
>>      error = netdev_rxq_recv(rxq, packets, &cnt);
>>      cycles_count_end(pmd, PMD_CYCLES_POLLING);
>>      if (!error) {
>> +        const struct pkt_metadata md =
>>PKT_METADATA_INITIALIZER(port->port_no);
>This change looks good. But I think we can improve it even more by
>replacing port->port_no with pkt_metadata. So that we do not need to
>initialize this structure on even packet receive.

You're right, it is indeed slightly faster (and an assignment is fine, we
do not
need an explicit memcpy).  I'll replace this commit with another one.

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to