Hi Aaron, >-----Original Message----- >From: Aaron Conole <acon...@redhat.com> >Sent: Tuesday, June 18, 2019 2:55 AM >To: Jerin Jacob Kollanukkaran <jer...@marvell.com> >Cc: dev@dpdk.org; Nithin Kumar Dabilpuram ><ndabilpu...@marvell.com>; Vamsi Krishna Attunuru ><vattun...@marvell.com>; Pavan Nikhilesh Bhagavatula ><pbhagavat...@marvell.com>; Olivier Matz <olivier.m...@6wind.com> >Subject: [EXT] Re: [dpdk-dev] [PATCH v3 25/27] mempool/octeontx2: >add optimized dequeue operation for arm64 > >> From: Pavan Nikhilesh <pbhagavat...@marvell.com> >> >> This patch adds an optimized arm64 instruction based routine to >leverage >> CPU pipeline characteristics of octeontx2. The theme is to fill the >> pipeline with CASP operations as much HW can do so that HW can do >alloc() >> HW ops in full throttle. >> >> Cc: Olivier Matz <olivier.m...@6wind.com> >> Cc: Aaron Conole <acon...@redhat.com> >> >> Signed-off-by: Pavan Nikhilesh <pbhagavat...@marvell.com> >> Signed-off-by: Jerin Jacob <jer...@marvell.com> >> Signed-off-by: Vamsi Attunuru <vattun...@marvell.com> >> --- >> drivers/mempool/octeontx2/otx2_mempool_ops.c | 291 >+++++++++++++++++++ >> 1 file changed, 291 insertions(+) >> >> diff --git a/drivers/mempool/octeontx2/otx2_mempool_ops.c >b/drivers/mempool/octeontx2/otx2_mempool_ops.c >> index c59bd73c0..e6737abda 100644 >> --- a/drivers/mempool/octeontx2/otx2_mempool_ops.c >> +++ b/drivers/mempool/octeontx2/otx2_mempool_ops.c >> @@ -37,6 +37,293 @@ npa_lf_aura_op_alloc_one(const int64_t >wdata, int64_t * const addr, >> return -ENOENT; >> } >> >> +#if defined(RTE_ARCH_ARM64) >> +static __rte_noinline int >> +npa_lf_aura_op_search_alloc(const int64_t wdata, int64_t * const >addr, >> + void **obj_table, unsigned int n) >> +{ >> + uint8_t i; >> + >> + for (i = 0; i < n; i++) { >> + if (obj_table[i] != NULL) >> + continue; >> + if (npa_lf_aura_op_alloc_one(wdata, addr, obj_table, >i)) >> + return -ENOENT; >> + } >> + >> + return 0; >> +} >> + >> +static __attribute__((optimize("-O3"))) __rte_noinline int __hot > >Sorry if I missed this before. > >Is there a good reason to hard-code this optimization, rather than let >the build system provide it?
Some versions of compiler don't have support for __int128_t for CASP inline-asm. i.e. if the optimization level is reduced to -O0 the CASP restrictions aren't followed and compiler might end up violation the CASP rules example: /tmp/ccSPMGzq.s:1648: Error: reg pair must start from even reg at operand 1 - `casp x21,x22,x0,x1,[x19]' /tmp/ccSPMGzq.s:1706: Error: reg pair must start from even reg at operand 1 - `casp x13,x14,x0,x1,[x11]' /tmp/ccSPMGzq.s:1745: Error: reg pair must start from even reg at operand 1 - `casp x9,x10,x0,x1,[x7]' /tmp/ccSPMGzq.s:1775: Error: reg pair must start from even reg at operand 1 - `casp x7,x8,x0,x1,[x5]'* Forcing to -O3 with __rte_noinline in place fixes it as the alignment fits in. Regards, Pavan. > >> +npa_lf_aura_op_alloc_bulk(const int64_t wdata, int64_t * const >addr, >> + unsigned int n, void **obj_table) >> +{ >> + const __uint128_t wdata128 = ((__uint128_t)wdata << 64) | >wdata; >> + uint64x2_t failed = vdupq_n_u64(~0); >> + >> + switch (n) { >> + case 32: >> + { >> + __uint128_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9; >> + __uint128_t t10, t11; >> + >> + asm volatile ( >> + ".cpu generic+lse\n" >> + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"