Hi Richard,

The 06/08/2020 16:42, Richard Sandiford wrote:
> Tamar Christina <tamar.christ...@arm.com> writes:
> > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> > index 
> > 97da60762390db81df9cffaf316b909cd1609130..9cc8da338125afa01bc9fb645f4112d2d7ef548c
> >  100644
> > --- a/gcc/config/aarch64/aarch64.c
> > +++ b/gcc/config/aarch64/aarch64.c
> > @@ -11279,6 +11279,14 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, 
> > int outer, bool speed)
> >    if (VECTOR_MODE_P (mode))
> >      mode = GET_MODE_INNER (mode);
> > 
> > +  /* The by element versions of the instruction has the same costs as the
> > +     normal 3 vector version.  So don't add the costs of the duplicate into
> > +     the costs of the multiply.  */
> > +  if (GET_CODE (op0) == VEC_DUPLICATE)
> > +    op0 = XEXP (op0, 0);
> > +  else if (GET_CODE (op1) == VEC_DUPLICATE)
> > +    op1 = XEXP (op1, 0);
> > +
> >    /* Integer multiply/fma.  */
> >    if (GET_MODE_CLASS (mode) == MODE_INT)
> >      {
> 
> SVE doesn't have duplicating forms, so I think we should put this code
> under the “if (VECTOR_MODE_P (mode))” condition, before changing “mode”,
> and then restrict it to VEC_ADVSIMD modes.
> 
> (SVE FMUL does have an indexed form, but the index is relative to the
> start of the associated quadword, so it isn't a VEC_DUPLICATE.)
>

Done, I have updated the patch. (See attached)
 
> I guess there's a danger that this could underestimate the cost for
> integer modes, if the scalar integer input needs to be moved from GPRs.
> In that case the cost of a MULT + VEC_DUPLICATE is probably more
> accurate, even though it's still one instruction before RA.
> 
> But I guess there's no perfect answer there.  The new code will be
> right for integer modes in some cases and not in others.  Same if
> we leave things as they are.  But maybe it'd be worth having a comment
> to say that we're assuming the best case, i.e. that the duplicated
> value is naturally in FPRs?
> 

Hmm I haven't added the comment yet since I don't fully understand when the
integer case would be misleading.

In both cases the cost for the GPR is paid by the MOV no? I'm missing why 
having the MUL
account for it would be better in some cases.

For instance for the integer case we used to generate

        dup     v0.4s, w2
        mul     v2.4s, v2.4s, v0.4s

but now do

        fmov    s0, w2
        mul     v2.4s, v2.4s, v0.s[0]

Which is better on older cores such Cortex-A55 and no different on newer cores 
such as
Cortex-A76 according to the optimization guides.

Regards,
Tamar

> Thanks,
> Richard

-- 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 973c65aa4fb348450872036617362aa17310fb20..d2e959c5276d9b801294c722c92762c5674cb244 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -11279,7 +11279,20 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, int outer, bool speed)
   op1 = XEXP (x, 1);
 
   if (VECTOR_MODE_P (mode))
-    mode = GET_MODE_INNER (mode);
+    {
+      unsigned int vec_flags = aarch64_classify_vector_mode (mode);
+      mode = GET_MODE_INNER (mode);
+      if (vec_flags & VEC_ADVSIMD)
+	{
+	  /* The by element versions of the instruction has the same costs as the
+	     normal 3 vector version.  So don't add the costs of the duplicate into
+	     the costs of the multiply.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE)
+	    op0 = XEXP (op0, 0);
+	  else if (GET_CODE (op1) == VEC_DUPLICATE)
+	    op1 = XEXP (op1, 0);
+	}
+    }
 
   /* Integer multiply/fma.  */
   if (GET_MODE_CLASS (mode) == MODE_INT)
diff --git a/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
new file mode 100644
index 0000000000000000000000000000000000000000..513721cee0c8372781e6daf33bc06e256cab8cb8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/asimd-mull-elem.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_float } */
+/* { dg-options "-Ofast" } */
+
+#include <arm_neon.h>
+
+void s_mult_i (int32_t* restrict res, int32_t* restrict a, int32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+void s_mult_f (float32_t* restrict res, float32_t* restrict a, float32_t b)
+{
+    for (int x = 0; x < 16; x++)
+      res[x] = a[x] * b;
+}
+
+/* { dg-final { scan-assembler-times {\s+mul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */
+/* { dg-final { scan-assembler-times {\s+fmul\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.s\[0\]} 4 } } */

Reply via email to