Hi Richard, The 06/09/2020 12:44, Richard Sandiford wrote: > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi Richard, > > The 06/08/2020 16:42, Richard Sandiford wrote: > >> Tamar Christina <tamar.christ...@arm.com> writes: > >> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > >> > index > >> > 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c > >> > 100644 > >> > --- a/gcc/config/aarch64/aarch64.c > >> > +++ b/gcc/config/aarch64/aarch64.c > >> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code > >> > code, int outer, bool speed) > >> > if (VECTOR_MODE_P (mode)) > >> > mode = GET_MODE_INNER (mode); > >> > > >> > + /* The by element versions of the instruction has the same costs as > >> > the > >> > + normal 3 vector version. So don't add the costs of the duplicate > >> > into > >> > + the costs of the multiply. */ > >> > + if (GET_CODE (op0) == VEC_DUPLICATE) > >> > + op0 = XEXP (op0, 0); > >> > + else if (GET_CODE (op1) == VEC_DUPLICATE) > >> > + op1 = XEXP (op1, 0); > >> > + > >> > /* Integer multiply/fma. */ > >> > if (GET_MODE_CLASS (mode) == MODE_INT) > >> > { > >> > >> SVE doesn't have duplicating forms, so I think we should put this code > >> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”, > >> and then restrict it to VEC_ADVSIMD modes. > >> > >> (SVE FMUL does have an indexed form, but the index is relative to the > >> start of the associated quadword, so it isn't a VEC_DUPLICATE.) > >> > > > > Done, I have updated the patch. (See attached) > > > >> I guess there's a danger that this could underestimate the cost for > >> integer modes, if the scalar integer input needs to be moved from GPRs. > >> In that case the cost of a MULT + VEC_DUPLICATE is probably more > >> accurate, even though it's still one instruction before RA. > >> > >> But I guess there's no perfect answer there. The new code will be > >> right for integer modes in some cases and not in others. Same if > >> we leave things as they are. But maybe it'd be worth having a comment > >> to say that we're assuming the best case, i.e. that the duplicated > >> value is naturally in FPRs? > >> > > > > Hmm I haven't added the comment yet since I don't fully understand when the > > integer case would be misleading. > > > > In both cases the cost for the GPR is paid by the MOV no? I'm missing > > why having the MUL account for it would be better in some cases. > > The point was that any MOV isn't exposed until after register allocation, > whereas costs are usually applied before then. So before RA: > > > For instance for the integer case we used to generate > > > > dup v0.4s, w2 > > mul v2.4s, v2.4s, v0.4s > > ...this was costed as: > > (set (reg:V4SI R2) (vec_duplicate:V4SI (reg:SI R1))) > (set (reg:V4SI R3) (mult:V4SI ...)) > > and so accurate when R1 naturally ends up in a GPR. > > > but now do > > > > fmov s0, w2 > > mul v2.4s, v2.4s, v0.s[0] > > ...and this is costed as: > > (set (reg:V4SI R3) (mult:V4SI ...)) > > and so accurate when R1 naturally ends up in an FPR (without needing > a reload to put it there). > > In other words, before RA, the patch is making the optimistic assumption > that R1 is already in FPRs and so a separate FMOV won't be needed. >
Aargggs... yes that makes sense. Sorry when I looked at the dump before I didn't noticed the order was switched. The SET was for the load of course. :( I have added the comment as suggested, thanks for the explanation. OK for master? Thanks, Tamar > Thanks, > Richard > > > Which is better on older cores such Cortex-A55 and no different on newer > > cores such as > > Cortex-A76 according to the optimization guides. > > > > Regards, > > Tamar > > > >> Thanks, > >> Richard --
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 973c65aa4fb348450872036617362aa17310fb20..5a5a9ad44f0945b4d6a869fc2b4e857022659c55 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -11279,7 +11279,22 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed) op1 = XEXP (x, 1); if (VECTOR_MODE_P (mode)) - mode = GET_MODE_INNER (mode); + { + unsigned int vec_flags = aarch64_classify_vector_mode (mode); + mode = GET_MODE_INNER (mode); + if (vec_flags & VEC_ADVSIMD) + { + /* The by element versions of the instruction has the same costs as the + normal 3 vector version. So don't add the costs of the duplicate into + the costs of the multiply. We make an assumption that the value in + the VEC_DUPLICATE is already the FP&SIMD side. This means costing of + a MUL by element pre RA is a bit optimistic. */ + if (GET_CODE (op0) == VEC_DUPLICATE) + op0 = XEXP (op0, 0); + else if (GET_CODE (op1) == VEC_DUPLICATE) + op1 = XEXP (op1, 0); + } + } /* Integer multiply/fma. */ if (GET_MODE_CLASS (mode) == MODE_INT) diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c new file mode 100644 index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target vect_float } */ +/* { dg-options "-Ofast" } */ + +#include <arm_neon.h> + +void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b) +{ + for (int x = 0; x < 16; x++) + res[x] = a[x] * b; +} + +void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b) +{ + for (int x = 0; x < 16; x++) + res[x] = a[x] * b; +} + +/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */ +/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */