Re: [dpdk-dev] [PATCH v3 2/2] examples/performance-thread: add arm64 support

2017-07-04 Thread Sekhar, Ashwin
This license is already there in many files in examples/performance-
thread directory.

There are two cases in my patch.

1.
I moved some code from examples/performance-thread/common/lthread.c to 
examples/performance-thread/common/arch/x86/stack.h. 
lthread.c already has the below kind of license. So I think there is no
issue retaining the same in stack.h also.

2. 
I added the following files.
examples/performance-thread/common/arch/arm64/ctx.c  
examples/performance-thread/common/arch/arm64/ctx.h  
examples/performance-thread/common/arch/arm64/stack.h

These are actually written by me and not taken from the github link.
By mistake I copied the license entirely :)

For these files, I shall remove it and re-post a v3.

Thanks
Ashwin


On Mon, 2017-07-03 at 21:21 +, O'Driscoll, Tim wrote:
> > 
> > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Thomas
> > Monjalon
> > 
> > There can be a licensing issue here.
> > We may need advice from the Governing Board and the Technical
> > Board.
> That's correct. This uses a 2-clause BSD license, but the
> Intellectual Property Policy section in the Project Charter (http://d
> pdk.org/about/charter#ip) specifies 3-clause BSD. If you really need
> to use a new license, then you'll need to make a request to the
> Governing Board as specified in clause 6.c in the charter.
> 
> > 
> > 
> > 18/05/2017 12:21, Ashwin Sekhar T K:
> > > 
> > > +/*
> > > + * https://github.com/halayli/lthread which carries the
> > > following
> > license.
> > > 
> > > + *
> > > + * Copyright (C) 2012, Hasan Alayli 
> > > + *
> > > + * Redistribution and use in source and binary forms, with or
> > > without
> > > + * modification, are permitted provided that the following
> > > conditions
> > > + * are met:
> > > + * 1. Redistributions of source code must retain the above
> > > copyright
> > > + *notice, this list of conditions and the following
> > > disclaimer.
> > > + * 2. Redistributions in binary form must reproduce the above
> > copyright
> > > 
> > > + *notice, this list of conditions and the following
> > > disclaimer in
> > the
> > > 
> > > + *documentation and/or other materials provided with the
> > distribution.
> > > 
> > > + *
> > > + * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS
> > > IS'' AND
> > > + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
> > > TO,
> > THE
> > > 
> > > + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
> > > PARTICULAR
> > PURPOSE
> > > 
> > > + * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE
> > LIABLE
> > > 
> > > + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> > CONSEQUENTIAL
> > > 
> > > + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
> > > SUBSTITUTE
> > GOODS
> > > 
> > > + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> > INTERRUPTION)
> > > 
> > > + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
> > CONTRACT, STRICT
> > > 
> > > + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
> > > ARISING IN
> > ANY WAY
> > > 
> > > + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
> > POSSIBILITY OF
> > > 
> > > + * SUCH DAMAGE.
> > > + */

Re: [dpdk-dev] [PATCH v5 2/4] eal: move gcc version definition to common header

2017-07-04 Thread Sekhar, Ashwin
On Mon, 2017-07-03 at 22:51 +0200, Thomas Monjalon wrote:
> 12/05/2017 12:15, Ashwin Sekhar T K:
> > 
> > Moved the definition of GCC_VERSION from lib/librte_table/rte_lru.h
> > to lib/librte_eal/common/include/rte_common.h.
> > 
> > Tested compilation on:
> >  * arm64 with gcc
> >  * x86 with gcc and clang
> > 
> > Signed-off-by: Ashwin Sekhar T K 
> > Reviewed-by: Jan Viktorin 
> > ---
> > --- a/lib/librte_eal/common/include/rte_common.h
> > +++ b/lib/librte_eal/common/include/rte_common.h
> > +/** Define GCC_VERSION **/
> > +#ifdef RTE_TOOLCHAIN_GCC
> > +#define GCC_VERSION (__GNUC__ * 1 + __GNUC_MINOR__ * 100 + 
> > \
> > +   __GNUC_PATCHLEVEL__)
> > +#endif
> [...]
> > 
> > --- a/lib/librte_table/rte_lru.h
> > +++ b/lib/librte_table/rte_lru.h
> > -#ifdef __INTEL_COMPILER
> > -#define GCC_VERSION (0)
> > -#else
> > -#define GCC_VERSION (__GNUC__ * 1+__GNUC_MINOR__*100 +
> > __GNUC_PATCHLEVEL__)
> > -#endif
> The ICC check is lost when moving in rte_common.h.

All usage of GCC_VERSION is kept under #ifdef RTE_TOOLCHAIN_GCC. So the
ICC check is not required.

Ashwin

Re: [dpdk-dev] [PATCH] sched: enable neon optimizations

2017-04-27 Thread Sekhar, Ashwin
On Friday 28 April 2017 09:20 AM, Jianbo Liu wrote:
> On 27 April 2017 at 21:00, Ashwin Sekhar T K
>  wrote:
>> * Enabled CONFIG_RTE_SCHED_VECTOR for arm64
>> * Verified the changes with sched_autotest unit test case
>>
>> Signed-off-by: Ashwin Sekhar T K 
>> ---
>>  config/defconfig_arm64-armv8a-linuxapp-gcc |  2 +-
>>  lib/librte_sched/rte_sched.c   | 22 ++
>>  2 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc 
>> b/config/defconfig_arm64-armv8a-linuxapp-gcc
>> index 65888ce..021044a 100644
>> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc
>> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
>> @@ -48,4 +48,4 @@ CONFIG_RTE_LIBRTE_FM10K_PMD=n
>>  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
>>  CONFIG_RTE_LIBRTE_AVP_PMD=n
>>
>> -CONFIG_RTE_SCHED_VECTOR=n
>> +CONFIG_RTE_SCHED_VECTOR=y
>
> It's enough to remove this line only, I don't think you must enable it
> explicitly in the armv8a common config.
>
Tried removing this line from armv8a config. But in that case 
RTE_SCHED_VECTOR doesn't get defined.
./config/common_base has "CONFIG_RTE_SCHED_VECTOR=n" as the default 
setting. So enabling explicitly is required.

- Ashwin

>> diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
>> index 614705d..4ba476a 100644
>> --- a/lib/librte_sched/rte_sched.c
>> +++ b/lib/librte_sched/rte_sched.c
>> @@ -58,6 +58,8 @@
>>
>>  #if defined(__SSE4__)
>>  #define SCHED_VECTOR_SSE4
>> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
>> +#define SCHED_VECTOR_NEON
>>  #endif
>>
>>  #endif
>> @@ -1732,6 +1734,26 @@ grinder_pipe_exists(struct rte_sched_port *port, 
>> uint32_t base_pipe)
>> return 1;
>>  }
>>
>> +#elif defined(SCHED_VECTOR_NEON)
>> +
>> +static inline int
>> +grinder_pipe_exists(struct rte_sched_port *port, uint32_t base_pipe)
>> +{
>> +   uint32x4_t index, pipes;
>> +   uint32_t *pos = (uint32_t *)port->grinder_base_bmp_pos;
>> +
>> +   index = vmovq_n_u32(base_pipe);
>> +   pipes = vld1q_u32(pos);
>> +   if (!vminvq_u32(veorq_u32(pipes, index)))
>> +   return 1;
>> +
>> +   pipes = vld1q_u32(pos + 4);
>> +   if (!vminvq_u32(veorq_u32(pipes, index)))
>> +   return 1;
>> +
>> +   return 0;
>> +}
>> +
>>  #else
>>
>>  static inline int
>> --
>> 2.7.4
>>
>



Re: [dpdk-dev] [PATCH] sched: enable neon optimizations

2017-04-27 Thread Sekhar, Ashwin
On Friday 28 April 2017 11:07 AM, Jianbo Liu wrote:
> On 28 April 2017 at 13:27, Sekhar, Ashwin  wrote:
>> On Friday 28 April 2017 09:20 AM, Jianbo Liu wrote:
>>> On 27 April 2017 at 21:00, Ashwin Sekhar T K
>>>  wrote:
>>>> * Enabled CONFIG_RTE_SCHED_VECTOR for arm64
>>>> * Verified the changes with sched_autotest unit test case
>>>>
>>>> Signed-off-by: Ashwin Sekhar T K 
>>>> ---
>>>>  config/defconfig_arm64-armv8a-linuxapp-gcc |  2 +-
>>>>  lib/librte_sched/rte_sched.c   | 22 ++
>>>>  2 files changed, 23 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc 
>>>> b/config/defconfig_arm64-armv8a-linuxapp-gcc
>>>> index 65888ce..021044a 100644
>>>> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc
>>>> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
>>>> @@ -48,4 +48,4 @@ CONFIG_RTE_LIBRTE_FM10K_PMD=n
>>>>  CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
>>>>  CONFIG_RTE_LIBRTE_AVP_PMD=n
>>>>
>>>> -CONFIG_RTE_SCHED_VECTOR=n
>>>> +CONFIG_RTE_SCHED_VECTOR=y
>>>
>>> It's enough to remove this line only, I don't think you must enable it
>>> explicitly in the armv8a common config.
>>>
>> Tried removing this line from armv8a config. But in that case
>> RTE_SCHED_VECTOR doesn't get defined.
>> ./config/common_base has "CONFIG_RTE_SCHED_VECTOR=n" as the default
>> setting. So enabling explicitly is required.
>>
>
> I know it must be enabled to use your enhancement. But I meant to keep
> the same as common_base (or other default configs) if there is no
> other strange reason to enable it.
>
> Thanks!
> Jianbo
>
Got it. Will update the patch removing CONFIG_RTE_SCHED_VECTOR=n from 
defconfig_arm64-armv8a-linuxapp-gcc and resend.
Thanks
Ashwin



Re: [dpdk-dev] [PATCH 1/2] net: add arm64 neon version of CRC compute APIs

2017-04-28 Thread Sekhar, Ashwin
Hi Jan,
Thanks for the comments. Please see my responses inline.

On Friday 28 April 2017 03:25 PM, Jan Viktorin wrote:
> Hello Ashwin Sekhar,
>
> some comments below...
>
> On Thu, 27 Apr 2017 07:10:20 -0700
> Ashwin Sekhar T K  wrote:
>
>> * Added CRC compute APIs for arm64 utilizing the pmull capability
>> * Added new file net_crc_neon.h to hold the arm64 pmull CRC
>>   implementation
>> * Added crypto capability in compilation of generic armv8 and
>>   thunderx targets
>> * pmull CRC version is used only after checking the pmull capability
>>   at runtime
>> * Verified the changes with crc_autotest unit test case
>>
>> Signed-off-by: Ashwin Sekhar T K 
>> ---
>>  MAINTAINERS   |   1 +
>>  lib/librte_eal/common/include/arch/arm/rte_vect.h |  45 +++
>>  lib/librte_net/net_crc_neon.h | 357 
>> ++
>>  lib/librte_net/rte_net_crc.c  |  32 +-
>>  lib/librte_net/rte_net_crc.h  |   2 +
>>  mk/machine/armv8a/rte.vars.mk |   2 +-
>>  mk/machine/thunderx/rte.vars.mk   |   2 +-
>>  mk/rte.cpuflags.mk|   3 +
>>  mk/toolchain/gcc/rte.toolchain-compat.mk  |   1 +
>>  9 files changed, 438 insertions(+), 7 deletions(-)
>>  create mode 100644 lib/librte_net/net_crc_neon.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 576d60a..283743e 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -149,6 +149,7 @@ F: lib/librte_lpm/rte_lpm_neon.h
>>  F: lib/librte_hash/rte*_arm64.h
>>  F: lib/librte_efd/rte*_arm64.h
>>  F: lib/librte_table/rte*_arm64.h
>> +F: lib/librte_net/net_crc_neon.h
>>  F: drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
>>  F: drivers/net/i40e/i40e_rxtx_vec_neon.c
>>  F: drivers/net/virtio/virtio_rxtx_simple_neon.c
>> diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h 
>> b/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> index 4107c99..9a3dfdf 100644
>> --- a/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> @@ -34,9 +34,18 @@
>>  #define _RTE_VECT_ARM_H_
>>
>>  #include 
>> +#include 
>> +
>>  #include "generic/rte_vect.h"
>>  #include "arm_neon.h"
>>
>> +#ifdef GCC_VERSION
>> +#undef GCC_VERSION
>> +#endif
>
> Why are you doing this? What is wrong with GCC_VERSION?
>
This is just to avoid multiple definitions of GCC_VERSION. Not required 
really. Can be removed.

>> +
>> +#define GCC_VERSION (__GNUC__ * 1 + __GNUC_MINOR__ * 100 \
>> ++ __GNUC_PATCHLEVEL__)
>> +
>
> If you have any specific requirements for testing GCC version then it
> should be done in a more elegant way. However, I do not understand your
> intention.
>
GCC version is checked so as to define wrappers for some neon intrinsics 
which are not available in GCC versions < 7.

Similar checks of GCC_VERSION done in ./lib/librte_table/rte_lru.h. 
Followed the same template here.
Also, this is the suggested approach by GCC. Please see below link.
https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

Please advise on more elegant ways of gcc version detection.
>>  #ifdef __cplusplus
>>  extern "C" {
>>  #endif
>> @@ -78,6 +87,42 @@ vqtbl1q_u8(uint8x16_t a, uint8x16_t b)
>>  }
>>  #endif
>>
>> +#if (GCC_VERSION < 7)
>
> Is this code is gcc-specific? In such case there should be check for
> GCC compiler. We can also build e.g. by clang.
>
Yes, the code is GCC specific. Currently there are only GCC targets  for 
arm and arm64. So no checks are done for other types of compilers.
>> +/*
>> + * NEON intrinsic vreinterpretq_u64_p128() is not supported
>> + * in GCC versions < 7
>> + */
>
> I'd be positive about those comments, like:
>
> NEON intrinsic vreinterpretq_u64_p128() is supported since GCC 7.
>
Thanks. Will make the comments positive.

>> +static inline uint64x2_t
>> +vreinterpretq_u64_p128(poly128_t x)
>> +{
>> +return (uint64x2_t)x;
>> +}
>> +
>> +/*
>> + * NEON intrinsic vreinterpretq_p64_u64() is not supported
>> + * in GCC versions < 7
>> + */
>> +static inline poly64x2_t
>> +vreinterpretq_p64_u64(uint64x2_t x)
>> +{
>> +return (poly64x2_t)x;
>> +}
>> +
>> +/*
>> + * NEON intrinsic vgetq_lane_p64() is not supported
>> + * in GCC versions < 7
>> + */
>> +static inline poly64_t
>> +vgetq_lane_p64(poly64x2_t x, const int lane)
>> +{
>> +assert(lane >= 0 && lane <= 1);
>> +
>> +poly64_t *p = (poly64_t *)&x;
>> +
>> +return p[lane];
>> +}
>> +#endif
>> +
>>  #ifdef __cplusplus
>>  }
>>  #endif
>> diff --git a/lib/librte_net/net_crc_neon.h b/lib/librte_net/net_crc_neon.h
>
> [...]
>
>>  # CPU_LDFLAGS =
>>  # CPU_ASFLAGS =
>>
>> -MACHINE_CFLAGS += -march=armv8-a+crc
>> +MACHINE_CFLAGS += -march=armv8-a+crc+crypto
>> diff --git a/mk/machine/thunderx/rte.vars.mk 
>> b/mk/machine/thunderx/rte.vars.mk
>> index ad5a379..6784105 100644
>> --- a/mk/machine/thunderx/rte.vars.mk
>> +++ b/mk/machine/thunderx/rte.vars.mk
>> @@ -55,4 +

Re: [dpdk-dev] [PATCH v2] efd: support lookup using neon intrinsics

2017-04-28 Thread Sekhar, Ashwin
On Friday 28 April 2017 03:36 PM, Jianbo Liu wrote:
> On 27 April 2017 at 20:44, Ashwin Sekhar T K
>  wrote:
>> * Added file lib/librte_efd/rte_efd_arm64.h to hold arm64
>>   specific definitions
>> * Verified the changes with efd_autotest unit test case
>>
>> Signed-off-by: Ashwin Sekhar T K 
>> ---
>> v2:
>> * Slightly modified the content of the commit message body
>> * Added prefix [dpdk-dev] to the email subject line
>>
>>  MAINTAINERS|  1 +
>>  lib/librte_efd/rte_efd.c   | 22 
>>  lib/librte_efd/rte_efd_arm64.h | 76 
>> ++
>>  3 files changed, 99 insertions(+)
>>  create mode 100644 lib/librte_efd/rte_efd_arm64.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index b6495d2..7d708ae 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -147,6 +147,7 @@ F: lib/librte_eal/common/include/arch/arm/*_64.h
>>  F: lib/librte_acl/acl_run_neon.*
>>  F: lib/librte_lpm/rte_lpm_neon.h
>>  F: lib/librte_hash/rte*_arm64.h
>> +F: lib/librte_efd/rte*_arm64.h
>>  F: drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
>>  F: drivers/net/i40e/i40e_rxtx_vec_neon.c
>>  F: drivers/net/virtio/virtio_rxtx_simple_neon.c
>> diff --git a/lib/librte_efd/rte_efd.c b/lib/librte_efd/rte_efd.c
>> index f601d62..4d9a088 100644
>> --- a/lib/librte_efd/rte_efd.c
>> +++ b/lib/librte_efd/rte_efd.c
>> @@ -53,6 +53,8 @@
>>  #include "rte_efd.h"
>>  #if defined(RTE_ARCH_X86)
>>  #include "rte_efd_x86.h"
>> +#elif defined(RTE_ARCH_ARM64)
>> +#include "rte_efd_arm64.h"
>>  #endif
>>
>>  #define EFD_KEY(key_idx, table) (table->keys + ((key_idx) * table->key_len))
>> @@ -103,6 +105,7 @@ allocated memory
>>  enum efd_lookup_internal_function {
>> EFD_LOOKUP_SCALAR = 0,
>> EFD_LOOKUP_AVX2,
>> +   EFD_LOOKUP_NEON,
>
> Should it be included in "if defined(RTE_ARCH_ARM64)"?
>
The enum can be wrapped under "if defined(RTE_ARCH_ARM64)" with no 
issues, as all its usages are also under "if defined(RTE_ARCH_ARM64)".
I followed EFD_LOOKUP_AVX2 and defined EFD_LOOKUP_NEON on the same lines.
Please advise on whether this change is to be made. Will follow your advice.
>> EFD_LOOKUP_NUM
>>  };
>>
>> @@ -674,6 +677,16 @@ rte_efd_create(const char *name, uint32_t 
>> max_num_rules, uint32_t key_len,
>> table->lookup_fn = EFD_LOOKUP_AVX2;
>> else
>>  #endif
>> +#if defined(RTE_ARCH_ARM64)
>> +   /*
>> +* For less than or equal to 16 bits, scalar function performs better
>> +* than vectorised version
>> +*/
>> +   if (RTE_EFD_VALUE_NUM_BITS > 16 &&
>> +   rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
>> +   table->lookup_fn = EFD_LOOKUP_NEON;
>> +   else
>> +#endif
>> table->lookup_fn = EFD_LOOKUP_SCALAR;
>>
>> /*
>> @@ -1271,6 +1284,15 @@ efd_lookup_internal(const struct 
>> efd_online_group_entry * const group,
>> group->lookup_table,
>> hash_val_a,
>> hash_val_b);
>> +   break;
>> +#endif
>> +#if defined(RTE_ARCH_ARM64)
>> +   case EFD_LOOKUP_NEON:
>> +   return efd_lookup_internal_neon(group->hash_idx,
>> +   group->lookup_table,
>> +   hash_val_a,
>> +   hash_val_b);
>> +   break;
>>  #endif
>> case EFD_LOOKUP_SCALAR:
>> /* Fall-through */
>> diff --git a/lib/librte_efd/rte_efd_arm64.h b/lib/librte_efd/rte_efd_arm64.h
>> new file mode 100644
>> index 000..cc93411
>> --- /dev/null
>> +++ b/lib/librte_efd/rte_efd_arm64.h
>> @@ -0,0 +1,76 @@
>> +/*
>> + *   BSD LICENSE
>> + *
>> + *   Copyright (C) Cavium networks Ltd. 2017.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + * * Redistributions of source code must retain the above copyright
>> + *   notice, this list of conditions and the following disclaimer.
>> + * * Redistributions in binary form must reproduce the above copyright
>> + *   notice, this list of conditions and the following disclaimer in
>> + *   the documentation and/or other materials provided with the
>> + *   distribution.
>> + * * Neither the name of Cavium networks nor the names of its
>> + *   contributors may be used to endorse or promote products derived
>> + *   from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT

Re: [dpdk-dev] [PATCH v3] efd: support lookup using neon intrinsics

2017-05-02 Thread Sekhar, Ashwin
On Tue, 2017-05-02 at 15:59 +0800, Jianbo Liu wrote:
> On 2 May 2017 at 14:41, Jerin Jacob 
> wrote:
> > 
> > -Original Message-
> > > 
> > > Date: Mon,  1 May 2017 22:59:53 -0700
> > > From: Ashwin Sekhar T K 
> > > To: byron.mar...@intel.com, pablo.de.lara.gua...@intel.com,
> > >  jerin.ja...@caviumnetworks.com, jianbo@linaro.org
> > > Cc: dev@dpdk.org, Ashwin Sekhar T K  > > .com>
> > > Subject: [dpdk-dev] [PATCH v3] efd: support lookup using neon
> > > intrinsics
> > > X-Mailer: git-send-email 2.13.0.rc1
> > > 
> > > * Added file lib/librte_efd/rte_efd_arm64.h to hold arm64
> > >   specific definitions
> > > * Verified the changes with efd_autotest unit test case
> > > 
> > > Signed-off-by: Ashwin Sekhar T K  > > m>
> > > ---
> > > v2:
> > > * Slightly modified the content of the commit message body
> > > * Added prefix [dpdk-dev] to the email subject line
> > > 
> > > v3:
> > > * Moved enum 'EFD_LOOKUP_NEON' under '#if
> > > defined(RTE_ARCH_ARM64)'
> > > 
> > >  MAINTAINERS|  1 +
> > >  lib/librte_efd/rte_efd.c   | 24 +
> > >  lib/librte_efd/rte_efd_arm64.h | 76
> > > ++
> > >  3 files changed, 101 insertions(+)
> > >  create mode 100644 lib/librte_efd/rte_efd_arm64.h
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index b6495d2..7d708ae 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -147,6 +147,7 @@ F:
> > > lib/librte_eal/common/include/arch/arm/*_64.h
> > >  F: lib/librte_acl/acl_run_neon.*
> > >  F: lib/librte_lpm/rte_lpm_neon.h
> > >  F: lib/librte_hash/rte*_arm64.h
> > > +F: lib/librte_efd/rte*_arm64.h
> > >  F: drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
> > >  F: drivers/net/i40e/i40e_rxtx_vec_neon.c
> > >  F: drivers/net/virtio/virtio_rxtx_simple_neon.c
> > > diff --git a/lib/librte_efd/rte_efd.c b/lib/librte_efd/rte_efd.c
> > > index f601d62..5cc6283 100644
> > > --- a/lib/librte_efd/rte_efd.c
> > > +++ b/lib/librte_efd/rte_efd.c
> > > @@ -53,6 +53,8 @@
> > >  #include "rte_efd.h"
> > >  #if defined(RTE_ARCH_X86)
> > >  #include "rte_efd_x86.h"
> > > +#elif defined(RTE_ARCH_ARM64)
> > > +#include "rte_efd_arm64.h"
> > >  #endif
> > > 
> > >  #define EFD_KEY(key_idx, table) (table->keys + ((key_idx) *
> > > table->key_len))
> > > @@ -103,6 +105,9 @@ allocated memory
> > >  enum efd_lookup_internal_function {
> > >   EFD_LOOKUP_SCALAR = 0,
> > >   EFD_LOOKUP_AVX2,
> > > +#if defined(RTE_ARCH_ARM64)
> > > + EFD_LOOKUP_NEON,
> > > +#endif
> > I think, we can remove this ifdef to
> > - Make code looks clean
> > - In future, in some case a new enum value gets added then the
> > value
> > will be different for each build.
> > 
> But the enum items are same for each ARCH.
> Besides, the ifdef could be considered as explanation to that enum.
> If
> someone knows nothing about arm/neon, he can ignore it totally after
> see the ifdef.
> 
Have added the #if defined on your advice, but in my opinion also its
better not to have "#if defined" for enums. Because the same enum can
take different values for different builds.

For eg: If somebody adds an EFD_LOOKUP_AVX512 after EFD_LOOKUP_NEON
here, it will take value 2 for x86 builds but value 3 for arm64 builds.
> > 
> > Any valid point to keep under RTE_ARCH_ARM64?
> > 
> > > 
> > >   EFD_LOOKUP_NUM
> > >  };

Re: [dpdk-dev] [PATCH 2/5] examples/l3fwd: rename l3fwd_em_sse.h to l3fwd_em_single.h

2017-05-02 Thread Sekhar, Ashwin
On Tue, 2017-05-02 at 15:14 +0800, Jianbo Liu wrote:
> The l3fwd_em_sse.h is enabled by NO_HASH_LOOKUP_MULTI.
> Renaming it because it's only for single hash lookup,
> and doesn't include any x86 SSE instructions.
> 
> Signed-off-by: Jianbo Liu 
> ---
>  examples/l3fwd/l3fwd_em.c| 2 +-
>  examples/l3fwd/{l3fwd_em_sse.h => l3fwd_em_single.h} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename examples/l3fwd/{l3fwd_em_sse.h => l3fwd_em_single.h} (100%)
> 
> diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
> index 939a16d..cccf797 100644
> --- a/examples/l3fwd/l3fwd_em.c
> +++ b/examples/l3fwd/l3fwd_em.c
> @@ -330,7 +330,7 @@ struct ipv6_l3fwd_em_route {
>  
>  #if defined(__SSE4_1__)
>  #if defined(NO_HASH_MULTI_LOOKUP)
> -#include "l3fwd_em_sse.h"
> +#include "l3fwd_em_single.h"
>  #else
>  #include "l3fwd_em_hlm.h"
>  #endif
> diff --git a/examples/l3fwd/l3fwd_em_sse.h
> b/examples/l3fwd/l3fwd_em_single.h
> similarity index 100%
> rename from examples/l3fwd/l3fwd_em_sse.h
> rename to examples/l3fwd/l3fwd_em_single.h

Shouldn't the guard __L3FWD_EM_SSE_H__ be update
to __L3FWD_EM_SINGLE_H__ to maintain consistency ?

Thanks and Regards,
Ashwin

Re: [dpdk-dev] [PATCH 5/5] examples/l3fwd: add neon support for l3fwd

2017-05-02 Thread Sekhar, Ashwin
Hi,

Please find comments inline.

On Tue, 2017-05-02 at 15:14 +0800, Jianbo Liu wrote:
> Use ARM NEON intrinsics to accelerate l3 fowarding.
> 
> Signed-off-by: Jianbo Liu 
> ---
>  examples/l3fwd/l3fwd.h |   4 -
>  examples/l3fwd/l3fwd_em.c  |   4 +-
>  examples/l3fwd/l3fwd_em_hlm.h  |   5 +
>  examples/l3fwd/l3fwd_em_hlm_neon.h |  74 +++
>  examples/l3fwd/l3fwd_em_single.h   |   4 +
>  examples/l3fwd/l3fwd_lpm.c |   4 +-
>  examples/l3fwd/l3fwd_lpm_neon.h| 157 ++
>  examples/l3fwd/l3fwd_neon.h| 259
> +
>  8 files changed, 504 insertions(+), 7 deletions(-)
>  create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_neon.h
> 
> diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
> index 011ba14..c45589a 100644
> --- a/examples/l3fwd/l3fwd.h
> +++ b/examples/l3fwd/l3fwd.h
> @@ -40,10 +40,6 @@
>  
>  #define RTE_LOGTYPE_L3FWD RTE_LOGTYPE_USER1
>  
> -#if !defined(NO_HASH_MULTI_LOOKUP) &&
> defined(RTE_MACHINE_CPUFLAG_NEON)
> -#define NO_HASH_MULTI_LOOKUP 1
> -#endif
> -
>  #define MAX_PKT_BURST 32
>  #define BURST_TX_DRAIN_US 100 /* TX drain every ~100us */
>  
> diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
> index cccf797..ac1e2e0 100644
> --- a/examples/l3fwd/l3fwd_em.c
> +++ b/examples/l3fwd/l3fwd_em.c
> @@ -328,7 +328,7 @@ struct ipv6_l3fwd_em_route {
>   return (uint8_t)((ret < 0) ? portid :
> ipv6_l3fwd_out_if[ret]);
>  }
>  
> -#if defined(__SSE4_1__)
> +#if defined(__SSE4_1__) || defined(RTE_MACHINE_CPUFLAG_NEON)
>  #if defined(NO_HASH_MULTI_LOOKUP)
>  #include "l3fwd_em_single.h"
>  #else
> @@ -709,7 +709,7 @@ struct ipv6_l3fwd_em_route {
>   if (nb_rx == 0)
>   continue;
>  
> -#if defined(__SSE4_1__)
> +#if defined(__SSE4_1__) || defined(RTE_MACHINE_CPUFLAG_NEON)
>   l3fwd_em_send_packets(nb_rx, pkts_burst,
>   portid,
> qconf);
>  #else
> diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> b/examples/l3fwd/l3fwd_em_hlm.h
> index 636dea4..3329c1a 100644
> --- a/examples/l3fwd/l3fwd_em_hlm.h
> +++ b/examples/l3fwd/l3fwd_em_hlm.h
> @@ -35,8 +35,13 @@
>  #ifndef __L3FWD_EM_HLM_H__
>  #define __L3FWD_EM_HLM_H__
>  
> +#if defined(__SSE4_1__)
>  #include "l3fwd_sse.h"
>  #include "l3fwd_em_hlm_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#include "l3fwd_em_hlm_neon.h"
> +#endif
>  
>  static inline __attribute__((always_inline)) void
>  em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> *m[8],
> diff --git a/examples/l3fwd/l3fwd_em_hlm_neon.h
> b/examples/l3fwd/l3fwd_em_hlm_neon.h
> new file mode 100644
> index 000..dae1acf
> --- /dev/null
> +++ b/examples/l3fwd/l3fwd_em_hlm_neon.h
> @@ -0,0 +1,74 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2017, Linaro Limited
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or
> without
> + *   modification, are permitted provided that the following
> conditions
> + *   are met:
> + *
> + * * Redistributions of source code must retain the above
> copyright
> + *   notice, this list of conditions and the following
> disclaimer.
> + * * Redistributions in binary form must reproduce the above
> copyright
> + *   notice, this list of conditions and the following
> disclaimer in
> + *   the documentation and/or other materials provided with the
> + *   distribution.
> + * * Neither the name of Intel Corporation nor the names of its
> + *   contributors may be used to endorse or promote products
> derived
> + *   from this software without specific prior written
> permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
> OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
> ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
> THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
> + */
> +
> +#ifndef __L3FWD_EM_HLM_NEON_H__
> +#define __L3FWD_EM_HLM_NEON_H__
> +
> +#include 
> +
> +static inline void
> +get_ipv4_5tuple(struct rte_mbuf *m0, int32x4_t mask0,
>

Re: [dpdk-dev] [PATCH 5/5] examples/l3fwd: add neon support for l3fwd

2017-05-02 Thread Sekhar, Ashwin
Hi Jianbo,

I tested your neon changes on thunderx. I am seeing a performance
regression of ~10% for LPM case and ~20% for EM case with your changes.
Did you see improvement on any arm64 platform with these changes. If
yes, how much was the improvement?

FYI, I had also tried vectorizing the l3fwd app with neon. Few of the
optimizations that I can suggest that helped in my case.

* Packet data prefetch is missing in the x86 sse version compared to
the scalar version (l3fwd_lpm_send_packets vs
l3fwd_lpm_no_opt_send_packets) . I couldn't understand why this was not
done in x86. But adding the prefetch was improving performance for
thunderx.

* Offsets to some packet elements like eth_hdr, ip header, packet type
etc. are recalculated in different functions. Calculating them once,
caching them and passing them directly to different functions was
improving performance.

* There are 3 different loops in l3fwd_lpm_send_packets where we
iterate over the packets. One each for processx4_step1 and
processx4_step2 and one in send_packets_multi. Unifying these loops
were also helping.

Thanks and Regards
Ashwin



Re: [dpdk-dev] [PATCH 5/5] examples/l3fwd: add neon support for l3fwd

2017-05-04 Thread Sekhar, Ashwin
On Thu, 2017-05-04 at 16:42 +0800, Jianbo Liu wrote:
> Hi Ashwin,
> 
> On 3 May 2017 at 13:24, Jianbo Liu  wrote:
> > 
> > Hi Ashwin,
> > 
> > On 2 May 2017 at 19:47, Sekhar, Ashwin 
> > wrote:
> > > 
> > > Hi Jianbo,
> > > 
> > > I tested your neon changes on thunderx. I am seeing a performance
> > > regression of ~10% for LPM case and ~20% for EM case with your
> > > changes.
> > > Did you see improvement on any arm64 platform with these changes.
> > > If
> > > yes, how much was the improvement?
> > Thanks for your reviewing and testing.
> > For some reason, I have not done much with the performance testing.
> > I'll send a new version later after tuning the performance.
> > 
> Can you tell me how did you test?
Built with following commands.
make config T=arm64-thunderx-linuxapp-gcc
make -j32

Tested LPM with
sudo ./examples/l3fwd/build/l3fwd -l 9,10  --master-lcore 9  -- -p 0x1 
--config="(0,0,10)"

Tested EM with
sudo ./examples/l3fwd/build/l3fwd -l 9,10  --master-lcore 9  -- -p 0x1 
--config="(0,0,10)" -E

> My testing shows that EM case is much better, while LPM is almost the
> same as before.
Could you please tell on which arm64 processor/platform you tested.
Also how much was the percentage increase in performance for EM ?

> Thanks!
> Jianbo

Re: [dpdk-dev] [PATCH 5/5] examples/l3fwd: add neon support for l3fwd

2017-05-09 Thread Sekhar, Ashwin
On Fri, 2017-05-05 at 13:43 +0800, Jianbo Liu wrote:
> On 5 May 2017 at 12:24, Sekhar, Ashwin 
> wrote:
> > 
> > On Thu, 2017-05-04 at 16:42 +0800, Jianbo Liu wrote:
> > > 
> > > Hi Ashwin,
> > > 
> > > On 3 May 2017 at 13:24, Jianbo Liu  wrote:
> > > > 
> > > > 
> > > > Hi Ashwin,
> > > > 
> > > > On 2 May 2017 at 19:47, Sekhar, Ashwin  > > > m>
> > > > wrote:
> > > > > 
> > > > > 
> > > > > Hi Jianbo,
> > > > > 
> > > > > I tested your neon changes on thunderx. I am seeing a
> > > > > performance
> > > > > regression of ~10% for LPM case and ~20% for EM case with
> > > > > your
> > > > > changes.
> > > > > Did you see improvement on any arm64 platform with these
> > > > > changes.
> > > > > If
> > > > > yes, how much was the improvement?
> > > > Thanks for your reviewing and testing.
> > > > For some reason, I have not done much with the performance
> > > > testing.
> > > > I'll send a new version later after tuning the performance.
> > > > 
> > > Can you tell me how did you test?
> > Built with following commands.
> > make config T=arm64-thunderx-linuxapp-gcc
> > make -j32
> > 
> > Tested LPM with
> > sudo ./examples/l3fwd/build/l3fwd -l 9,10  --master-lcore 9  -- -p
> > 0x1 --config="(0,0,10)"
> > 
> > Tested EM with
> > sudo ./examples/l3fwd/build/l3fwd -l 9,10  --master-lcore 9  -- -p
> > 0x1 --config="(0,0,10)" -E
> > 
> Only one port? What's the network topology, and lpm/em rules? How did
> you stress traffic...?
port - 1 topology: DUT connected back to back to traffic generator.

We are using the default rules in the C code. flow generation is:
src.ip.min 192.168.18.1
src.ip.max 192.168.18.90
src.ip.inc 1

Also, Please let us know the topology that you are using.
> 
> > 
> > > 
> > > My testing shows that EM case is much better, while LPM is almost
> > > the
> > > same as before.
> > Could you please tell on which arm64 processor/platform you tested.
> > Also how much was the percentage increase in performance for EM ?
> > 
> I'm sorry I can't tell you what's arm64 platform I tested on. But I
> can get a ThunderX, and replicate your testing environment if you can
> tell me more...
Thanks.
> 
> Thanks!
> Jianbo

Re: [dpdk-dev] [PATCH v2 7/7] examples/l3fwd: change the guard micro name for header file

2017-05-10 Thread Sekhar, Ashwin
In commit message:
s/micro/macro/

On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote:
> As l3fwd_em_sse.h is renamed to l3fwd_em_sequential.h, change the
> macro
> to __L3FWD_EM_SEQUENTIAL_H__ to maintain consistency.
> 
> Signed-off-by: Jianbo Liu 
> ---
>  examples/l3fwd/l3fwd_em_sequential.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd_em_sequential.h
> b/examples/l3fwd/l3fwd_em_sequential.h
> index c3df473..63c5c12 100644
> --- a/examples/l3fwd/l3fwd_em_sequential.h
> +++ b/examples/l3fwd/l3fwd_em_sequential.h
> @@ -31,8 +31,8 @@
>   *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> DAMAGE.
>   */
>  
> -#ifndef __L3FWD_EM_SSE_H__
> -#define __L3FWD_EM_SSE_H__
> +#ifndef __L3FWD_EM_SEQUENTIAL_H__
> +#define __L3FWD_EM_SEQUENTIAL_H__
>  
>  /**
>   * @file
> @@ -125,4 +125,4 @@ static inline __attribute__((always_inline))
> uint16_t
>  
>   send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
>  }
> -#endif /* __L3FWD_EM_SSE_H__ */
> +#endif /* __L3FWD_EM_SEQUENTIAL_H__ */

Re: [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd

2017-05-10 Thread Sekhar, Ashwin
Hi Jianbo,

Thanks for version v2. Addition of the prefetch instructions is
definitely helping performance on ThunderX. But still performance is
slightly less than that of scalar.

I tried few small tweaks which helped improve performance on my
Thunderx setup. For details see comments inline.


On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote:
> Use ARM NEON intrinsics to accelerate l3 fowarding.
> 
> Signed-off-by: Jianbo Liu 
> ---
>  examples/l3fwd/l3fwd_em.c|   4 +-
>  examples/l3fwd/l3fwd_em_hlm.h|  19 ++-
>  examples/l3fwd/l3fwd_em_hlm_neon.h   |  74 ++
>  examples/l3fwd/l3fwd_em_sequential.h |  20 ++-
>  examples/l3fwd/l3fwd_lpm.c   |   4 +-
>  examples/l3fwd/l3fwd_lpm_neon.h  | 165 ++
>  examples/l3fwd/l3fwd_neon.h  | 259
> +++
>  7 files changed, 539 insertions(+), 6 deletions(-)
>  create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_neon.h
> 
> [...]
> diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> b/examples/l3fwd/l3fwd_em_hlm.h
> index 636dea4..4ec600a 100644
> --- a/examples/l3fwd/l3fwd_em_hlm.h
> +++ b/examples/l3fwd/l3fwd_em_hlm.h
> @@ -35,8 +35,13 @@
>  #ifndef __L3FWD_EM_HLM_H__
>  #define __L3FWD_EM_HLM_H__
>  
> +#if defined(__SSE4_1__)
>  #include "l3fwd_sse.h"
>  #include "l3fwd_em_hlm_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#include "l3fwd_em_hlm_neon.h"
> +#endif
>  
>  static inline __attribute__((always_inline)) void
>  em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> *m[8],
> @@ -238,7 +243,7 @@ static inline __attribute__((always_inline))
> uint16_t
>  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>   uint8_t portid, struct lcore_conf *qconf)
>  {
> - int32_t j;
> + int32_t i, j, pos;
>   uint16_t dst_port[MAX_PKT_BURST];
>  
>   /*
> @@ -247,6 +252,12 @@ static inline __attribute__((always_inline))
> uint16_t
>    */
>   int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
>  
> + for (j = 0; j < 8 && j < nb_rx; j++) {
> + rte_prefetch0(pkts_burst[j]);
The above prefetch of rte_mbuf struct is unnecessary. With this we wont
see any performance improvement as the contents of rte_mbuf (buf_addr
and data_off) is used in right next instruction. Removing the above
prefetch and similar prefetches at multiple places was improving
performance on my ThunderX setup.

> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> +    struct ether_hdr *) + 
> 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1. In
process_packet in l3fwd_neon.h, eth_header is accessed.

> + }
> +
>   for (j = 0; j < n; j += 8) {
>  
>   uint32_t pkt_type =
> @@ -263,6 +274,12 @@ static inline __attribute__((always_inline))
> uint16_t
>   uint32_t tcp_or_udp = pkt_type &
>   (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
>  
> + for (i = 0, pos = j + 8; i < 8 && pos < nb_rx; i++, 
> pos++) {
> + rte_prefetch0(pkts_burst[pos]);
The above prefetch of rte_mbuf struct is unnecessary.

> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po
> s],
> +    struct
> ether_hdr *) + 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1

> + }
> +
>   if (tcp_or_udp && (l3_type == RTE_PTYPE_L3_IPV4)) {
>  
>   em_get_dst_port_ipv4x8(qconf,
> &pkts_burst[j], portid,
> 
> [...]

> diff --git a/examples/l3fwd/l3fwd_em_sequential.h
> b/examples/l3fwd/l3fwd_em_sequential.h
> index c0a9725..c3df473 100644
> --- a/examples/l3fwd/l3fwd_em_sequential.h
> +++ b/examples/l3fwd/l3fwd_em_sequential.h
> @@ -43,7 +43,11 @@
>   * compilation time.
>   */
>  
> +#if defined(__SSE4_1__)
>  #include "l3fwd_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#endif
>  
>  static inline __attribute__((always_inline)) uint16_t
>  em_get_dst_port(const struct lcore_conf *qconf, struct rte_mbuf
> *pkt,
> @@ -101,11 +105,23 @@ static inline __attribute__((always_inline))
> uint16_t
>  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>   uint8_t portid, struct lcore_conf *qconf)
>  {
> - int32_t j;
> + int32_t i, j;
>   uint16_t dst_port[MAX_PKT_BURST];
>  
> - for (j = 0; j < nb_rx; j++)
> + if (nb_rx > 0) {
> + rte_prefetch0(pkts_burst[0]);
The above prefetch of rte_mbuf struct is unnecessary.

> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[0],
> +    struct ether_hdr *) +
> 1);
Better to prefetch at eth_hdr itself and not at eth_hdr + 1

> + }
> +
> + for (i = 1, j = 0; j < nb_rx; i++, j++) {
> + if (i < nb_rx) {
> +  

Re: [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd

2017-05-10 Thread Sekhar, Ashwin
On Thu, 2017-05-11 at 11:16 +0800, Jianbo Liu wrote:
> Hi Ashwin,
> 
> On 10 May 2017 at 23:00, Sekhar, Ashwin 
> wrote:
> > 
> > Hi Jianbo,
> > 
> > Thanks for version v2. Addition of the prefetch instructions is
> > definitely helping performance on ThunderX. But still performance
> > is
> > slightly less than that of scalar.
> > 
> > I tried few small tweaks which helped improve performance on my
> > Thunderx setup. For details see comments inline.
> > 
> > 
> > On Wed, 2017-05-10 at 10:30 +0800, Jianbo Liu wrote:
> > > 
> > > Use ARM NEON intrinsics to accelerate l3 fowarding.
> > > 
> > > Signed-off-by: Jianbo Liu 
> > > ---
> > >  examples/l3fwd/l3fwd_em.c|   4 +-
> > >  examples/l3fwd/l3fwd_em_hlm.h|  19 ++-
> > >  examples/l3fwd/l3fwd_em_hlm_neon.h   |  74 ++
> > >  examples/l3fwd/l3fwd_em_sequential.h |  20 ++-
> > >  examples/l3fwd/l3fwd_lpm.c   |   4 +-
> > >  examples/l3fwd/l3fwd_lpm_neon.h  | 165
> > > ++
> > >  examples/l3fwd/l3fwd_neon.h  | 259
> > > +++
> > >  7 files changed, 539 insertions(+), 6 deletions(-)
> > >  create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
> > >  create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
> > >  create mode 100644 examples/l3fwd/l3fwd_neon.h
> > > 
> > > [...]
> > > diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> > > b/examples/l3fwd/l3fwd_em_hlm.h
> > > index 636dea4..4ec600a 100644
> > > --- a/examples/l3fwd/l3fwd_em_hlm.h
> > > +++ b/examples/l3fwd/l3fwd_em_hlm.h
> > > @@ -35,8 +35,13 @@
> > >  #ifndef __L3FWD_EM_HLM_H__
> > >  #define __L3FWD_EM_HLM_H__
> > > 
> > > +#if defined(__SSE4_1__)
> > >  #include "l3fwd_sse.h"
> > >  #include "l3fwd_em_hlm_sse.h"
> > > +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> > > +#include "l3fwd_neon.h"
> > > +#include "l3fwd_em_hlm_neon.h"
> > > +#endif
> > > 
> > >  static inline __attribute__((always_inline)) void
> > >  em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> > > *m[8],
> > > @@ -238,7 +243,7 @@ static inline __attribute__((always_inline))
> > > uint16_t
> > >  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
> > >   uint8_t portid, struct lcore_conf *qconf)
> > >  {
> > > - int32_t j;
> > > + int32_t i, j, pos;
> > >   uint16_t dst_port[MAX_PKT_BURST];
> > > 
> > >   /*
> > > @@ -247,6 +252,12 @@ static inline __attribute__((always_inline))
> > > uint16_t
> > >    */
> > >   int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
> > > 
> > > + for (j = 0; j < 8 && j < nb_rx; j++) {
> > > + rte_prefetch0(pkts_burst[j]);
> > The above prefetch of rte_mbuf struct is unnecessary. With this we
> > wont
> > see any performance improvement as the contents of rte_mbuf
> > (buf_addr
> > and data_off) is used in right next instruction. Removing the above
> > prefetch and similar prefetches at multiple places was improving
> > performance on my ThunderX setup.
> Yes, will remove them.
> 
> > 
> > 
> > > 
> > > + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> > > +struct ether_hdr *)
> > > +
> > > 1);
> > Better to prefetch at eth_hdr itself and not at eth_hdr + 1. In
> > process_packet in l3fwd_neon.h, eth_header is accessed in
> > 
> But ip headers are used right in each 8/FWDSTEP loop.
> Since ip headers are accessed first, we should prefetch eth_hdr + 1
> first.
> After all nb_rx packets are handled in above small loop, their
> eth_header are then accessed in processx4_step3 over again.
> I'm not sure prefretching eth_hdr still works if we prefetch eth_hdr
> in first step,  as cache may be already filled with new data at that
> time.
> 
Okay. 
Also, I guess if the ethernet header and ip header falls in the same
cache line (which I think would be the case mostly as I hope the packet
data will be cache aligned), it doesn't make much of a  difference
whether you prefetch at ethernet header address or ip header address.
> > 
> > > 
> > > + }
> > > +
> > >   for (j = 0; j < n; j += 8) {
> > > 
> > >   uint32_t pkt_t

Re: [dpdk-dev] [PATCH v2 5/7] examples/l3fwd: add neon support for l3fwd

2017-05-10 Thread Sekhar, Ashwin

On Thu, 2017-05-11 at 04:14 +, Sekhar, Ashwin wrote:
...
> > > Combining all the above comments, I made some changes on top of
> > > your
> > > patch. These changes are giving 3-4% improvement over your
> > > version.
> > > 
> > > You may find the changes at
> > > https://gist.github.com/ashwinyes/34cbdd999784402c859c71613587faf
> > > c
> > > 
> > Is the correct in Line 103/104, you only process one packets in the
> > last FWDSTEP packets?
> Its doing processx4_* there. So its processing 4 packets.
> 
> > 
> > Actually, I don't like your change in l3fwd_lpm_send_packets,
> > making
> > the simple logic complicated. And I don't think it can help to
> > improve
> > performance. :-)
> Its not making it complicated. The number of lines of code may be
> higher by may be 10 lines, but the conditions of the loops are
> simplified which reduces the number of branch instructions and helps
> the processor to go through them faster.
> 
> If possible, please try it out on your machine.

Missed out one point.
Since 2 loops are form "for (i = 0; i < FWDSTEP; i++)" i.e. looping for
constant number of iterations, compiler will easily unroll them.

Thanks
Ashwin
> > 
> > 
> > > 
> > > 
> > > Please check it out and let me know your comments.
> > > 
> > > Thanks
> > > Ashwin

Re: [dpdk-dev] [PATCH 2/6] config: add clang support for armv8a linuxapp

2017-05-10 Thread Sekhar, Ashwin
On Thu, 2017-05-11 at 10:54 +0530, Jerin Jacob wrote:
> -Original Message-
> > 
> > Date: Wed, 10 May 2017 03:16:39 -0700
> > From: Ashwin Sekhar T K 
> > To: tho...@monjalon.net, jerin.ja...@caviumnetworks.com,
> >  maciej.cze...@caviumnetworks.com, vikto...@rehivetech.com,
> >  jianbo@linaro.org, bruce.richard...@intel.com,
> >  pablo.de.lara.gua...@intel.com, konstantin.anan...@intel.com
> > Cc: dev@dpdk.org, Ashwin Sekhar T K  > om>
> > Subject: [dpdk-dev] [PATCH 2/6] config: add clang support for
> > armv8a
> >  linuxapp
> > X-Mailer: git-send-email 2.13.0.rc1
> > 
> > Added new config arm64-armv8a-linuxapp-clang
> > 
> > Signed-off-by: Ashwin Sekhar T K 
> > ---
> >  config/defconfig_arm64-armv8a-linuxapp-clang | 56
> > 
> >  1 file changed, 56 insertions(+)
> >  create mode 100644 config/defconfig_arm64-armv8a-linuxapp-clang
> > 
> > diff --git a/config/defconfig_arm64-armv8a-linuxapp-clang
> > b/config/defconfig_arm64-armv8a-linuxapp-clang
> > +#include "common_linuxapp"
> > +
> > +CONFIG_RTE_MACHINE="armv8a"
> > +
> > +CONFIG_RTE_ARCH="arm64"
> > +CONFIG_RTE_ARCH_ARM64=y
> > +CONFIG_RTE_ARCH_64=y
> > +
> > +CONFIG_RTE_FORCE_INTRINSICS=y
> > +
> > +CONFIG_RTE_TOOLCHAIN="clang"
> > +CONFIG_RTE_TOOLCHAIN_CLANG=y
> > +
> > +# Maximum available cache line size in arm64 implementations.
> > +# Setting to maximum available cache line size in generic config
> > +# to address minimum DMA alignment across all arm64
> > implementations.
> > +CONFIG_RTE_CACHE_LINE_SIZE=128
> > +
> > +CONFIG_RTE_EAL_IGB_UIO=n
> > +
> > +CONFIG_RTE_LIBRTE_FM10K_PMD=n
> > +CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
> > +CONFIG_RTE_LIBRTE_AVP_PMD=n
> > +
> > +CONFIG_RTE_SCHED_VECTOR=n
> IMO, It is better to create common_armv8 config and let gcc and clang
> use that to avoid duplicating the symbols.
> 
For x86, this is the convention that is followed. There are separate
defconfigs for icc, gcc, clang with symbols duplicated. Do we need to
deviate from this convention for armv8a?
> > 

Re: [dpdk-dev] [PATCH v3 5/7] examples/l3fwd: add neon support for l3fwd

2017-05-11 Thread Sekhar, Ashwin
Hi Jianbo,

Thanks for v3. Small compilation error. See inline comment. Otherwise
it looks fine.

On Thu, 2017-05-11 at 17:25 +0800, Jianbo Liu wrote:
> Use ARM NEON intrinsics to accelerate l3 fowarding.
> 
> Signed-off-by: Jianbo Liu 
> ---
[...]

> +/**
> + * Process one packet:
> + * Update source and destination MAC addresses in the ethernet
> header.
> + * Perform RFC1812 checks and updates for IPV4 packets.
> + */
> +static inline void
> +process_packet(struct rte_mbuf *pkt, uint16_t *dst_port)
> +{
> + struct ether_hdr *eth_hdr;
> + uint32x4_t te, ve;
> +
> + eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> +
> + te = vld1q_u32((uint32_t *)eth_hdr);
> + ve = vreinterpretq_u32_s32(val_eth[dst_port[0]]);
> +
> +
> + rfc1812_process((struct ipv4_hdr *)(eth_hdr + 1), dst_port,
> + pkt->packet_type);
> +
> + ve = vcopyq_lane_u32(ve, 3, te, 3);
Compilation error here. This should be vcopyq_laneq_u32 (Extra q after
lane)
> + vst1q_u32((uint32_t *)eth_hdr, ve);
> +}
> +
[...]

Re: [dpdk-dev] [PATCH v3 5/7] examples/l3fwd: add neon support for l3fwd

2017-05-11 Thread Sekhar, Ashwin
On Thu, 2017-05-11 at 18:01 +0800, Jianbo Liu wrote:
> On 11 May 2017 at 17:49, Sekhar, Ashwin 
> wrote:
> > 
> > Hi Jianbo,
> > 
> > Thanks for v3. Small compilation error. See inline comment.
> > Otherwise
> > it looks fine.
> > 
> > On Thu, 2017-05-11 at 17:25 +0800, Jianbo Liu wrote:
> > > 
> > > Use ARM NEON intrinsics to accelerate l3 fowarding.
> > > 
> > > Signed-off-by: Jianbo Liu 
> > > ---
> > [...]
> > 
> > > 
> > > +/**
> > > + * Process one packet:
> > > + * Update source and destination MAC addresses in the ethernet
> > > header.
> > > + * Perform RFC1812 checks and updates for IPV4 packets.
> > > + */
> > > +static inline void
> > > +process_packet(struct rte_mbuf *pkt, uint16_t *dst_port)
> > > +{
> > > + struct ether_hdr *eth_hdr;
> > > + uint32x4_t te, ve;
> > > +
> > > + eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> > > +
> > > + te = vld1q_u32((uint32_t *)eth_hdr);
> > > + ve = vreinterpretq_u32_s32(val_eth[dst_port[0]]);
> > > +
> > > +
> > > + rfc1812_process((struct ipv4_hdr *)(eth_hdr + 1), dst_port,
> > > + pkt->packet_type);
> > > +
> > > + ve = vcopyq_lane_u32(ve, 3, te, 3);
> > Compilation error here. This should be vcopyq_laneq_u32 (Extra q
> > after
> > lane)
> No vcopyq_laneq_u32 in arm_neon.h of my environment. I thought it's a
> typo so I changed.
> 
> my gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC).
> What about yours?
> 
I am using GCC 7.1. No error with this version.

Also to cross check I tried the following versions as well which all
gave compilation errors.
 * gcc (Ubuntu/Linaro 4.9.2-10ubuntu13) 4.9.2
 * gcc 5.3
 * GCC 6.3

So looks like vcopyq_laneq_u32 is not supported in GCC versions < 7.
We can add a wrapper for the same in
./lib/librte_eal/common/include/arch/arm/rte_vect.h for gcc versions <
7.

But I think we can defer this activity. Because I have some other
patches, which moves around the definition of GCC_VERSION, and adds
wrappers for some unsupported instrinsics. Please see below.
http://dpdk.org/dev/patchwork/patch/24161/
http://dpdk.org/dev/patchwork/patch/24162/

I think we can add the vcopyq_laneq_u32 change and the wrapper for the
same after the above patches are merged.

And FYI - Documentation for the vcopyq_laneq_u32 can be found in below
document.
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0
073A_arm_neon_intrinsics_ref.pdf

> > 
> > > 
> > > + vst1q_u32((uint32_t *)eth_hdr, ve);
> > > +}
> > > +
> > [...]

Re: [dpdk-dev] [PATCH 0/6] add clang compilation support for armv8a linuxapp

2017-05-11 Thread Sekhar, Ashwin
The warning comes only when CFLAGS "-g -ggdb" are given and this seems
to be an issue with clang. I am seeing some related bugs on llvm
mailing list.
https://www.mail-archive.com/llvm-bugs@lists.llvm.org/msg05498.html
http://lists.llvm.org/pipermail/llvm-bugs/2016-July/048288.html

Even a simple c program with a TLS variable creates this warning.
For eg: 
- test.c ---
__thread int a;

int main() {
return 0;
}


$ gcc -g -ggdb test.c   
$ clang -g -ggdb test.c 
/usr/bin/ld: /tmp/test-50ceac.o(.debug_info+0x37): R_AARCH64_ABS64
used with TLS symbol a
$

Thanks
Ashwin

On Thu, 2017-05-11 at 11:29 +0530, Jerin Jacob wrote:
> -Original Message-
> > 
> > Date: Wed, 10 May 2017 03:16:37 -0700
> > From: Ashwin Sekhar T K 
> > To: tho...@monjalon.net, jerin.ja...@caviumnetworks.com,
> >  maciej.cze...@caviumnetworks.com, vikto...@rehivetech.com,
> >  jianbo@linaro.org, bruce.richard...@intel.com,
> >  pablo.de.lara.gua...@intel.com, konstantin.anan...@intel.com
> > Cc: dev@dpdk.org, Ashwin Sekhar T K  > om>
> > Subject: [dpdk-dev] [PATCH 0/6] add clang compilation support for
> > armv8a
> >  linuxapp
> > X-Mailer: git-send-email 2.13.0.rc1
> > 
> > This series of patches adds the clang compilation support for
> > armv8a linuxapp.
> > 
> > Patch 1 is basically for removing the usage of assembly directive
> > ".arch armv8-a+crc"
> > as this is not understood by clang. For removing these directives,
> > compilation of
> > armv8a crc32 support is made conditional and is only done for
> > machines which has
> > the crc extensions. Doing this avoids the need for having the
> > ".arch armv8-a+crc"
> > directives in the code.
> > 
> > Patch 2 adds the arm64-armv8a-linuxapp-clang defconfig.
> > 
> > Patch 3, 4, 5 and 6 are for fixing the compilation errors/warnings.
> There is warning on LD with clang. Could you please check it?
> 
>   INSTALL-MAP dpdk-pdump.map
>   LD testpmd
> /usr/bin/ld: build/lib/librte_eal.a(eal_thread.o)(.debug_info+0x37):
> R_AARCH64_ABS64 used with TLS symbol per_lcore__lcore_id
> /usr/bin/ld: build/lib/librte_eal.a(eal_thread.o)(.debug_info+0x54):
> R_AARCH64_ABS64 used with TLS symbol per_lcore__socket_id
> /usr/bin/ld: build/lib/librte_eal.a(eal_thread.o)(.debug_info+0x6a):
> R_AARCH64_ABS64 used with TLS symbol per_lcore__cpuset
> /usr/bin/ld: build/lib/librte_eal.a(eal_thread.o)(.debug_info+0xd2):
> R_AARCH64_ABS64 used with TLS symbol rte_gettid.per_lcore__thread_id
> /usr/bin/ld:
> build/lib/librte_eal.a(eal_interrupts.o)(.debug_info+0x38e):
> R_AARCH64_ABS64 used with TLS symbol per_lcore__epfd
> /usr/bin/ld:
> build/lib/librte_eal.a(eal_common_errno.o)(.debug_info+0x50):
> R_AARCH64_ABS64 used with TLS symbol rte_strerror.per_lcore_retval
> /usr/bin/ld:build/lib/librte_eal.a(eal_common_errno.o)(.debug_info+0x
> 91): R_AARCH64_ABS64 used with TLS symbol per_lcore__rte_errno
>   INSTALL-APP testpmd
> 
> $ clang -v
> Ubuntu clang version 3.6.0-2ubuntu1 (tags/RELEASE_360/final) (based
> on
> LLVM 3.6.0)
> Target: aarch64-unknown-linux-gnu
> Thread model: posix
> Found candidate GCC installation:
> /usr/bin/../lib/gcc/aarch64-linux-gnu/4.9
> Found candidate GCC installation:
> /usr/bin/../lib/gcc/aarch64-linux-gnu/4.9.2
> Found candidate GCC installation:
> /usr/bin/../lib/gcc/aarch64-linux-gnu/5.0.1
> Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-gnu/4.9
> Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-
> gnu/4.9.2
> Found candidate GCC installation: /usr/lib/gcc/aarch64-linux-
> gnu/5.0.1
> Selected GCC installation: /usr/bin/../lib/gcc/aarch64-linux-gnu/4.9
> Candidate multilib: .;@m64
> Selected multilib: .;@m64
> 
> 
> > 
> > 
> > Ashwin Sekhar T K (6):
> >   hash: compile armv8a CRC32 support conditionally
> >   config: add clang support for armv8a linuxapp
> >   net/thunderx: fix compile errors for armv8a clang
> >   acl: fix warning seen with armv8a clang
> >   eal/arm: fix warnings seen with armv8a clang
> >   eal: fix warning seen with armv8a clang
> > 
> >  config/defconfig_arm64-armv8a-linuxapp-clang   | 56
> > ++
> >  drivers/net/thunderx/base/nicvf_plat.h |  2 +-
> >  lib/librte_acl/Makefile|  5 +-
> >  .../common/include/arch/arm/rte_byteorder.h|  2 +-
> >  lib/librte_eal/linuxapp/eal/Makefile   |  4 ++
> >  lib/librte_hash/Makefile   |  2 +
> >  lib/librte_hash/rte_crc_arm64.h|  4 --
> >  lib/librte_hash/rte_hash_crc.h |  2 +-
> >  8 files changed, 69 insertions(+), 8 deletions(-)
> >  create mode 100644 config/defconfig_arm64-armv8a-linuxapp-clang
> > 

Re: [dpdk-dev] [PATCH v4 3/4] net: add arm64 neon version of CRC compute APIs

2017-05-12 Thread Sekhar, Ashwin
On Fri, 2017-05-12 at 13:51 +0800, Jianbo Liu wrote:
> On 9 May 2017 at 17:53, Ashwin Sekhar T K
>  wrote:
> > 
> > Added CRC compute APIs for arm64 utilizing the pmull
> > capability
> > 
> > Added new file net_crc_neon.h to hold the arm64 pmull
> > CRC implementation
> > 
> > Verified the changes with crc_autotest unit test case
> > 
> > Signed-off-by: Ashwin Sekhar T K 
> > ---
> > v2:
> > * Fixed merge conflict in MAINTAINERS
> > 
> > v3:
> > * Moved feature detection changes and GCC_VERSION definition
> >   changes to separate commit
> > * Replaced usage of assert() with RTE_ASSERT()
> > * Made the comments in rte_vect.h more positive in sense
> > 
> > v4:
> > * Rebased on top of latest commit
> > 
> >  MAINTAINERS   |   1 +
> >  lib/librte_eal/common/include/arch/arm/rte_vect.h |  28 ++
> >  lib/librte_net/net_crc_neon.h | 357
> > ++
> >  lib/librte_net/rte_net_crc.c  |  34 ++-
> >  lib/librte_net/rte_net_crc.h  |   2 +
> >  5 files changed, 416 insertions(+), 6 deletions(-)
> >  create mode 100644 lib/librte_net/net_crc_neon.h
> > 
> > 
...
> > +
> > +struct crc_pmull_ctx crc32_eth_pmull __rte_aligned(16);
> > +struct crc_pmull_ctx crc16_ccitt_pmull __rte_aligned(16);
> > +
> > +static inline uint8x16_t
> > +extract_vector(uint8x16_t v0, uint8x16_t v1, const int n)
> > +{
> > +   switch (n) {
> > +   case 0: return vextq_u8(v0, v1, 0);
> > +   case 1: return vextq_u8(v0, v1, 1);
> > +   case 2: return vextq_u8(v0, v1, 2);
> > +   case 3: return vextq_u8(v0, v1, 3);
> > +   case 4: return vextq_u8(v0, v1, 4);
> > +   case 5: return vextq_u8(v0, v1, 5);
> > +   case 6: return vextq_u8(v0, v1, 6);
> > +   case 7: return vextq_u8(v0, v1, 7);
> > +   case 8: return vextq_u8(v0, v1, 8);
> > +   case 9: return vextq_u8(v0, v1, 9);
> > +   case 10: return vextq_u8(v0, v1, 10);
> > +   case 11: return vextq_u8(v0, v1, 11);
> > +   case 12: return vextq_u8(v0, v1, 12);
> > +   case 13: return vextq_u8(v0, v1, 13);
> > +   case 14: return vextq_u8(v0, v1, 14);
> > +   case 15: return vextq_u8(v0, v1, 15);
> > +   }
> > +   return v1;
> > +}
> > +
> > +/**
> > + * Shifts right 128 bit register by specified number of bytes
> > + *
> > + * @param reg 128 bit value
> > + * @param num number of bytes to shift reg by (0-16)
> > + *
> > + * @return reg << (num * 8)
> > + */
> > +static inline uint64x2_t
> > +shift_bytes_right(uint64x2_t reg, const unsigned int num)
> > +{
> > +   /* Right Shift */
> > +   return vreinterpretq_u64_u8(extract_vector(
> > +   vreinterpretq_u8_u64(reg),
> > +   vdupq_n_u8(0),
> > +   num));
> > +}
> > +
> > +/**
> > + * Shifts left 128 bit register by specified number of bytes
> > + *
> > + * @param reg 128 bit value
> > + * @param num number of bytes to shift reg by (0-16)
> > + *
> > + * @return reg << (num * 8)
> > + */
> > +static inline uint64x2_t
> > +shift_bytes_left(uint64x2_t reg, const unsigned int num)
> > +{
> > +   /* Left Shift */
> > +   return vreinterpretq_u64_u8(extract_vector(
> > +   vdupq_n_u8(0),
> > +   vreinterpretq_u8_u64(reg),
> > +   16 - num));
> > +}
> > +
> Can you move shift_bytes_right/shift_bytes_left to rte_vect.h because
> they are common functions?
These are not really common functions. I dont think it will have a
wider usage as its shifting by bytes and not by bits.

In x86 case also, xmm_shift_left is not made a common function.

Moreover, I have not tested the behaviour of these functions when the
shift amt is (< 0) or (> 16) as these cases will never arise in the CRC
code.

Thanks
Ashwin

Re: [dpdk-dev] [PATCH v4 3/4] net: add arm64 neon version of CRC compute APIs

2017-05-12 Thread Sekhar, Ashwin
On Fri, 2017-05-12 at 16:49 +0800, Jianbo Liu wrote:
> On 12 May 2017 at 15:25, Sekhar, Ashwin 
> wrote:
> > 
> > On Fri, 2017-05-12 at 13:51 +0800, Jianbo Liu wrote:
> > > 
> > > On 9 May 2017 at 17:53, Ashwin Sekhar T K
> > >  wrote:
> > > > 
> > > > 
> > > > Added CRC compute APIs for arm64 utilizing the pmull
> > > > capability
> > > > 
> > > > Added new file net_crc_neon.h to hold the arm64 pmull
> > > > CRC implementation
> > > > 
> > > > Verified the changes with crc_autotest unit test case
> > > > 
> > > > Signed-off-by: Ashwin Sekhar T K  > > > com>
> > > > ---
> > > > v2:
> > > > * Fixed merge conflict in MAINTAINERS
> > > > 
> > > > v3:
> > > > * Moved feature detection changes and GCC_VERSION definition
> > > >   changes to separate commit
> > > > * Replaced usage of assert() with RTE_ASSERT()
> > > > * Made the comments in rte_vect.h more positive in sense
> > > > 
> > > > v4:
> > > > * Rebased on top of latest commit
> > > > 
> > > >  MAINTAINERS   |   1 +
> > > >  lib/librte_eal/common/include/arch/arm/rte_vect.h |  28 ++
> > > >  lib/librte_net/net_crc_neon.h | 357
> > > > ++
> > > >  lib/librte_net/rte_net_crc.c  |  34 ++-
> > > >  lib/librte_net/rte_net_crc.h  |   2 +
> > > >  5 files changed, 416 insertions(+), 6 deletions(-)
> > > >  create mode 100644 lib/librte_net/net_crc_neon.h
> > > > 
> > > > 
> > ...
> > > 
> > > > 
> > > > +
> > > > +struct crc_pmull_ctx crc32_eth_pmull __rte_aligned(16);
> > > > +struct crc_pmull_ctx crc16_ccitt_pmull __rte_aligned(16);
> > > > +
> > > > +static inline uint8x16_t
> > > > +extract_vector(uint8x16_t v0, uint8x16_t v1, const int n)
> > > > +{
> > > > +   switch (n) {
> > > > +   case 0: return vextq_u8(v0, v1, 0);
> > > > +   case 1: return vextq_u8(v0, v1, 1);
> > > > +   case 2: return vextq_u8(v0, v1, 2);
> > > > +   case 3: return vextq_u8(v0, v1, 3);
> > > > +   case 4: return vextq_u8(v0, v1, 4);
> > > > +   case 5: return vextq_u8(v0, v1, 5);
> > > > +   case 6: return vextq_u8(v0, v1, 6);
> > > > +   case 7: return vextq_u8(v0, v1, 7);
> > > > +   case 8: return vextq_u8(v0, v1, 8);
> > > > +   case 9: return vextq_u8(v0, v1, 9);
> > > > +   case 10: return vextq_u8(v0, v1, 10);
> > > > +   case 11: return vextq_u8(v0, v1, 11);
> > > > +   case 12: return vextq_u8(v0, v1, 12);
> > > > +   case 13: return vextq_u8(v0, v1, 13);
> > > > +   case 14: return vextq_u8(v0, v1, 14);
> > > > +   case 15: return vextq_u8(v0, v1, 15);
> > > > +   }
> > > > +   return v1;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Shifts right 128 bit register by specified number of bytes
> > > > + *
> > > > + * @param reg 128 bit value
> > > > + * @param num number of bytes to shift reg by (0-16)
> > > > + *
> > > > + * @return reg << (num * 8)
> > > > + */
> > > > +static inline uint64x2_t
> > > > +shift_bytes_right(uint64x2_t reg, const unsigned int num)
> > > > +{
> > > > +   /* Right Shift */
> > > > +   return vreinterpretq_u64_u8(extract_vector(
> > > > +   vreinterpretq_u8_u64(reg),
> > > > +   vdupq_n_u8(0),
> > > > +   num));
> > > > +}
> > > > +
> > > > +/**
> > > > + * Shifts left 128 bit register by specified number of bytes
> > > > + *
> > > > + * @param reg 128 bit value
> > > > + * @param num number of bytes to shift reg by (0-16)
> > > > + *
> > > > + * @return reg << (num * 8)
> > > > + */
> > > > +static inline uint64x2_t
> > > > +shift_bytes_left(uint64x2_t reg, const unsigned int num)
> > > > +{
> > > > +   /* Left Shift */
> > > > +   return vreinterpretq_u64_u8(extract_vector(
> > > > +   vdupq_n_u8(0),
> > > > +   vreinterpretq_u8_u64(reg),
> > > > +   16 - num));
> > > > +}
> > > > +
> > > Can you move shift_bytes_right/shift_bytes_left to rte_vect.h
> > > because
> > > they are common functions?
> > These are not really common functions. I dont think it will have a
> > wider usage as its shifting by bytes and not by bits.
> > 
> I think these shifting may be used by other functions.
> For example, to replace  _mm_srli_si128.
> 
> > 
> > In x86 case also, xmm_shift_left is not made a common function.
> > 
> But its counterpart right shifting (_mm_srli_si128) is...
> 
> > 
> > Moreover, I have not tested the behaviour of these functions when
> > the
> > shift amt is (< 0) or (> 16) as these cases will never arise in the
> > CRC
> > code.
> > 
> You can define thee functions according to current requirement.
> And I don't think this parameter can be <0 or > 16.

Okay. In that case, I will move it to rte_vect.h.

Ashwin

Re: [dpdk-dev] [PATCH v4 6/8] examples/l3fwd: add neon support for l3fwd

2017-05-14 Thread Sekhar, Ashwin
On Mon, 2017-05-15 at 11:34 +0800, Jianbo Liu wrote:
> Use ARM NEON intrinsics to accelerate l3 fowarding.
> 
> Signed-off-by: Jianbo Liu 
Acked-by: Ashwin Sekhar T K 
> ---
>  examples/l3fwd/l3fwd_em.c|   4 +-
>  examples/l3fwd/l3fwd_em_hlm.h|  17 ++-
>  examples/l3fwd/l3fwd_em_hlm_neon.h   |  74 ++
>  examples/l3fwd/l3fwd_em_sequential.h |  18 ++-
>  examples/l3fwd/l3fwd_lpm.c   |   4 +-
>  examples/l3fwd/l3fwd_lpm_neon.h  | 193
> ++
>  examples/l3fwd/l3fwd_neon.h  | 259
> +++
>  7 files changed, 563 insertions(+), 6 deletions(-)
>  create mode 100644 examples/l3fwd/l3fwd_em_hlm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_lpm_neon.h
>  create mode 100644 examples/l3fwd/l3fwd_neon.h
> 
> diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
> index ba844b2..da96cfd 100644
> --- a/examples/l3fwd/l3fwd_em.c
> +++ b/examples/l3fwd/l3fwd_em.c
> @@ -328,7 +328,7 @@ struct ipv6_l3fwd_em_route {
>   return (uint8_t)((ret < 0) ? portid :
> ipv6_l3fwd_out_if[ret]);
>  }
>  
> -#if defined(__SSE4_1__)
> +#if defined(__SSE4_1__) || defined(RTE_MACHINE_CPUFLAG_NEON)
>  #if defined(NO_HASH_MULTI_LOOKUP)
>  #include "l3fwd_em_sequential.h"
>  #else
> @@ -709,7 +709,7 @@ struct ipv6_l3fwd_em_route {
>   if (nb_rx == 0)
>   continue;
>  
> -#if defined(__SSE4_1__)
> +#if defined(__SSE4_1__) || defined(RTE_MACHINE_CPUFLAG_NEON)
>   l3fwd_em_send_packets(nb_rx, pkts_burst,
>   portid,
> qconf);
>  #else
> diff --git a/examples/l3fwd/l3fwd_em_hlm.h
> b/examples/l3fwd/l3fwd_em_hlm.h
> index 636dea4..b9163e3 100644
> --- a/examples/l3fwd/l3fwd_em_hlm.h
> +++ b/examples/l3fwd/l3fwd_em_hlm.h
> @@ -35,8 +35,13 @@
>  #ifndef __L3FWD_EM_HLM_H__
>  #define __L3FWD_EM_HLM_H__
>  
> +#if defined(__SSE4_1__)
>  #include "l3fwd_sse.h"
>  #include "l3fwd_em_hlm_sse.h"
> +#elif defined(RTE_MACHINE_CPUFLAG_NEON)
> +#include "l3fwd_neon.h"
> +#include "l3fwd_em_hlm_neon.h"
> +#endif
>  
>  static inline __attribute__((always_inline)) void
>  em_get_dst_port_ipv4x8(struct lcore_conf *qconf, struct rte_mbuf
> *m[8],
> @@ -238,7 +243,7 @@ static inline __attribute__((always_inline))
> uint16_t
>  l3fwd_em_send_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>   uint8_t portid, struct lcore_conf *qconf)
>  {
> - int32_t j;
> + int32_t i, j, pos;
>   uint16_t dst_port[MAX_PKT_BURST];
>  
>   /*
> @@ -247,6 +252,11 @@ static inline __attribute__((always_inline))
> uint16_t
>    */
>   int32_t n = RTE_ALIGN_FLOOR(nb_rx, 8);
>  
> + for (j = 0; j < 8 && j < nb_rx; j++) {
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> +    struct ether_hdr *) +
> 1);
> + }
> +
>   for (j = 0; j < n; j += 8) {
>  
>   uint32_t pkt_type =
> @@ -263,6 +273,11 @@ static inline __attribute__((always_inline))
> uint16_t
>   uint32_t tcp_or_udp = pkt_type &
>   (RTE_PTYPE_L4_TCP | RTE_PTYPE_L4_UDP);
>  
> + for (i = 0, pos = j + 8; i < 8 && pos < nb_rx; i++,
> pos++) {
> + rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[po
> s],
> +    struct
> ether_hdr *) + 1);
> + }
> +
>   if (tcp_or_udp && (l3_type == RTE_PTYPE_L3_IPV4)) {
>  
>   em_get_dst_port_ipv4x8(qconf,
> &pkts_burst[j], portid,
> diff --git a/examples/l3fwd/l3fwd_em_hlm_neon.h
> b/examples/l3fwd/l3fwd_em_hlm_neon.h
> new file mode 100644
> index 000..dae1acf
> --- /dev/null
> +++ b/examples/l3fwd/l3fwd_em_hlm_neon.h
> @@ -0,0 +1,74 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2017, Linaro Limited
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or
> without
> + *   modification, are permitted provided that the following
> conditions
> + *   are met:
> + *
> + * * Redistributions of source code must retain the above
> copyright
> + *   notice, this list of conditions and the following
> disclaimer.
> + * * Redistributions in binary form must reproduce the above
> copyright
> + *   notice, this list of conditions and the following
> disclaimer in
> + *   the documentation and/or other materials provided with the
> + *   distribution.
> + * * Neither the name of Intel Corporation nor the names of its
> + *   contributors may be used to endorse or promote products
> derived
> + *   from this software without specific prior written
> permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES

Re: [dpdk-dev] [PATCH] examples/performance-thread: add arm64 support

2017-05-18 Thread Sekhar, Ashwin
On Thu, 2017-05-18 at 14:35 +0800, Jianbo Liu wrote:
> On 18 May 2017 at 02:44, Jerin Jacob 
> wrote:
> > 
> > -Original Message-
> > > 
> > > Date: Wed, 17 May 2017 11:19:49 -0700
> > > From: Ashwin Sekhar T K 
> > > To: jerin.ja...@caviumnetworks.com, john.mcnam...@intel.com,
> > >  jianbo@linaro.org
> > > Cc: dev@dpdk.org, Ashwin Sekhar T K  > > .com>
> > > Subject: [dpdk-dev] [PATCH] examples/performance-thread: add
> > > arm64 support
> > > X-Mailer: git-send-email 2.12.2
> > > 
> > > Updated Makefile to allow compilation for arm64 architecture.
> > > 
> > > Moved the code for setting the initial stack to architecture
> > > specific
> > > directory.
> > Please split the patch to two
> > - "arch_set_stack" abstraction and associated x86 change
> > - arm64 support
> There are so many redundant code in l3fwd and l3fwd-thread, I think
> it's possible to merge them.
> 
Yes. But I think its better to do them as a completely separate patch
set.
> > 
> > 
> > Thanks Ashwin.
> > 
> > I think, This may be the last feature to make arm64 at par with x86
> > features
> > supported in DPDK.
> > 
> > /Jerin

Re: [dpdk-dev] [PATCH v2 2/2] examples/performance-thread: add arm64 support

2017-05-18 Thread Sekhar, Ashwin
On Thu, 2017-05-18 at 14:25 +0530, Jerin Jacob wrote:
> -Original Message-
> > 
> > Date: Thu, 18 May 2017 00:34:26 -0700
> > From: Ashwin Sekhar T K 
> > To: jerin.ja...@caviumnetworks.com, john.mcnam...@intel.com,
> >  jianbo@linaro.org
> > Cc: dev@dpdk.org, Ashwin Sekhar T K  > om>
> > Subject: [dpdk-dev] [PATCH v2 2/2] examples/performance-thread: add
> > arm64
> >  support
> > X-Mailer: git-send-email 2.12.2
> > 
> > Updated Makefile to allow compilation for arm64 architecture.
> > 
> > Added necessary arm64 support for lthread.
> > 
> > Fixed minor compilation errors for arm64 compilation.
> > 
> > Tested the apps l3fwd-thread and lthread_pthread_shim on thunderx
> > and x86_64.
> > 
> > +void
> > +ctx_switch(struct ctx *new_ctx __rte_unused, struct ctx *curr_ctx
> > __rte_unused)
> > +{
> > +   /* SAVE CURRENT CONTEXT */
> > +   asm volatile (
> > +   /* Save SP */
> > +   "mov x3, sp\n"
> > +   "str x3, [x1, #0]\n"
> > +
> > +   /* Save FP and LR */
> > +   "stp x29, x30, [x1, #8]\n"
> > +
> > +   /* Save Callee Saved Regs x19 - x28 */
> > +   "stp x19, x20, [x1, #24]\n"
> > +   "stp x21, x22, [x1, #40]\n"
> > +   "stp x23, x24, [x1, #56]\n"
> > +   "stp x25, x26, [x1, #72]\n"
> > +   "stp x27, x28, [x1, #88]\n"
> > +    );
> IMO, We need to save SIMD registers in the context as well.
> x86 code also not doing that, looks like it is an obvious bug in x86
> code as
> well.
> 
Yes. You are correct. Need to save the bottom 64-bits of called saved
ASIMD regs v8-v15. Will update the patch.