octeontx2: add optimized dequeue operation for arm64

Pavan Nikhilesh Bhagavatula Tue, 18 Jun 2019 00:40:11 -0700

Hi Aaron,

>-----Original Message-----
>From: Aaron Conole <acon...@redhat.com>
>Sent: Tuesday, June 18, 2019 2:55 AM
>To: Jerin Jacob Kollanukkaran <jer...@marvell.com>
>Cc: dev@dpdk.org; Nithin Kumar Dabilpuram
><ndabilpu...@marvell.com>; Vamsi Krishna Attunuru
><vattun...@marvell.com>; Pavan Nikhilesh Bhagavatula
><pbhagavat...@marvell.com>; Olivier Matz <olivier.m...@6wind.com>
>Subject: [EXT] Re: [dpdk-dev] [PATCH v3 25/27] mempool/octeontx2:
>add optimized dequeue operation for arm64
>
>> From: Pavan Nikhilesh <pbhagavat...@marvell.com>
>>
>> This patch adds an optimized arm64 instruction based routine to
>leverage
>> CPU pipeline characteristics of octeontx2. The theme is to fill the
>> pipeline with CASP operations as much HW can do so that HW can do
>alloc()
>> HW ops in full throttle.
>>
>> Cc: Olivier Matz <olivier.m...@6wind.com>
>> Cc: Aaron Conole <acon...@redhat.com>
>>
>> Signed-off-by: Pavan Nikhilesh <pbhagavat...@marvell.com>
>> Signed-off-by: Jerin Jacob <jer...@marvell.com>
>> Signed-off-by: Vamsi Attunuru <vattun...@marvell.com>
>> ---
>>  drivers/mempool/octeontx2/otx2_mempool_ops.c | 291
>+++++++++++++++++++
>>  1 file changed, 291 insertions(+)
>>
>> diff --git a/drivers/mempool/octeontx2/otx2_mempool_ops.c
>b/drivers/mempool/octeontx2/otx2_mempool_ops.c
>> index c59bd73c0..e6737abda 100644
>> --- a/drivers/mempool/octeontx2/otx2_mempool_ops.c
>> +++ b/drivers/mempool/octeontx2/otx2_mempool_ops.c
>> @@ -37,6 +37,293 @@ npa_lf_aura_op_alloc_one(const int64_t
>wdata, int64_t * const addr,
>>      return -ENOENT;
>>  }
>>
>> +#if defined(RTE_ARCH_ARM64)
>> +static __rte_noinline int
>> +npa_lf_aura_op_search_alloc(const int64_t wdata, int64_t * const
>addr,
>> +            void **obj_table, unsigned int n)
>> +{
>> +    uint8_t i;
>> +
>> +    for (i = 0; i < n; i++) {
>> +            if (obj_table[i] != NULL)
>> +                    continue;
>> +            if (npa_lf_aura_op_alloc_one(wdata, addr, obj_table,
>i))
>> +                    return -ENOENT;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static  __attribute__((optimize("-O3"))) __rte_noinline int __hot
>
>Sorry if I missed this before.
>
>Is there a good reason to hard-code this optimization, rather than let
>the build system provide it?


Some versions of compiler don't have support for __int128_t for CASP inline-asm.
i.e. if the optimization level is reduced to -O0 the CASP restrictions aren't 
followed and 
compiler might end up violation the CASP rules example:

/tmp/ccSPMGzq.s:1648: Error: reg pair must start from even reg at operand 1 - 
`casp x21,x22,x0,x1,[x19]'
/tmp/ccSPMGzq.s:1706: Error: reg pair must start from even reg at operand 1 - 
`casp x13,x14,x0,x1,[x11]'
/tmp/ccSPMGzq.s:1745: Error: reg pair must start from even reg at operand 1 - 
`casp x9,x10,x0,x1,[x7]'
/tmp/ccSPMGzq.s:1775: Error: reg pair must start from even reg at operand 1 - 
`casp x7,x8,x0,x1,[x5]'*

Forcing to -O3 with __rte_noinline in place fixes it as the alignment fits in.

Regards,
Pavan.

>
>> +npa_lf_aura_op_alloc_bulk(const int64_t wdata, int64_t * const
>addr,
>> +                      unsigned int n, void **obj_table)
>> +{
>> +    const __uint128_t wdata128 = ((__uint128_t)wdata << 64) |
>wdata;
>> +    uint64x2_t failed = vdupq_n_u64(~0);
>> +
>> +    switch (n) {
>> +    case 32:
>> +    {
>> +            __uint128_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9;
>> +            __uint128_t t10, t11;
>> +
>> +            asm volatile (
>> +            ".cpu  generic+lse\n"
>> +            "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"

Re: [dpdk-dev] [EXT] Re: [PATCH v3 25/27] mempool/octeontx2: add optimized dequeue operation for arm64

Reply via email to