On Wed, 2019-04-10 at 11:52 -0400, Aaron Conole wrote: > Jerin Jacob Kollanukkaran <jer...@marvell.com> writes: > > > On Mon, 2019-04-08 at 14:24 -0400, Aaron Conole wrote: > > > --------------------------------------------------------------- > > > ---- > > > --- > > > Compiler complains of argument type mismatch, like: > > > > Can you share more details on how to reproduce this issue? > > It will be generated using the meson build after enabling the neon > extension support (which isn't currently happening on ARM using meson > as > the build environment).
Can you share the patch to enable this for testing. Since the additional memcpy in fastpath, I need to check the overhead and check the possibility to avoid the memcpy to case. > > > We already have > > CFLAGS_acl_run_neon.o += -flax-vector-conversions > > in the Makefile. > > > > If you are taking out -flax-vector-conversions the correct way to > > fix will be use vreinterpret*. > > > > For me the code looks clean, If unnecessary casting is avoided. > > I agree. I merely make explicit the casts that the compiler will be > implicitly introducing. > > > > ../lib/librte_acl/acl_run_neon.h: In function ‘transition4’: > > > ../lib/librte_acl/acl_run_neon.h:115:2: note: use -flax- > > > vector- > > > conversions > > > to permit conversions between vectors with differing > > > element > > > types > > > or numbers of subparts > > > node_type = vbicq_s32(tr_hi_lo.val[0], index_msk); > > > ^ > > > ../lib/librte_acl/acl_run_neon.h:115:41: error: incompatible > > > type > > > for > > > argument 2 of ‘vbicq_s32’ > > > > > > Signed-off-by: Aaron Conole <acon...@redhat.com> > > > --- > > > lib/librte_acl/acl_run_neon.h | 46 ++++++++++++++++++++--------- > > > ---- > > > -- > > > 1 file changed, 27 insertions(+), 19 deletions(-) > > > > > > > > > > > > /* > > > @@ -179,6 +183,9 @@ search_neon_8(const struct rte_acl_ctx *ctx, > > > const uint8_t **data, > > > acl_match_check_x4(0, ctx, parms, &flows, &index_array[0]); > > > acl_match_check_x4(4, ctx, parms, &flows, &index_array[4]); > > > > > > + memset(&input0, 0, sizeof(input0)); > > > + memset(&input1, 0, sizeof(input1)); > > > > Why this memset only required for arm64? If it real issue, > > Shouldn't > > it required for x86 and ppc ? > > No. Please see the following lines (which is due to the ARM neon > intrinsic for setting individual lanes): > > while (flows.started > 0) { > /* Gather 4 bytes of input data for each stream. */ > input0 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > input0, 0); > input1 = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 4), > input1, 0); > > Note: the first time through this loop, input0 and input1 appear on > the > rhs of the assignment before appearing on the lhs. This will > generate > an uninitialized value warning, even though the assignments are to > individual lanes of the vector. > > I squelched the warning from the compiler in the most brute-force way > possible. Perhaps it would be better to use a static initialization > for > the vector but this code was intended to be RFC and to generate > feedback. > > I guess one alternate approach could be: > > static const int32x4_t ZERO_VEC; > int32x4_t input0 = ZERO_VEC, input1 = ZERO_VEC; > > ... > > int32x4_t input = ZERO_VEC; > > This would have the benefit of keeping the initializer as 'fast' as > possible (although I recall a memset under a certain size threshold > is > the same effect, but not certain). > > Either way, I prefer it to squelching the warning, since the warning > has been found to catch legitimate errors many times. I will get back to this after reproducing the issue locally.