Hi Richard

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: 03 November 2020 11:34
> To: Sudakshina Das <sudi....@arm.com>
> Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org;
> Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>
> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> __builtin_memset
> 
> Sudakshina Das <sudi....@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: 30 October 2020 19:56
> >> To: Sudakshina Das <sudi....@arm.com>
> >> Cc: Wilco Dijkstra <wilco.dijks...@arm.com>; gcc-patches@gcc.gnu.org;
> >> Kyrylo Tkachov <kyrylo.tkac...@arm.com>; Richard Earnshaw
> >> <richard.earns...@arm.com>
> >> Subject: Re: [PATCH] aarch64: Add backend support for expanding
> >> __builtin_memset
> >>
> >> > +  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));  dst =
> >> > + adjust_automodify_address (dst, VOIDmode, base, 0);
> >> > +
> >> > +  /* Prepare the val using a DUP v0.16B, val.  */  if (CONST_INT_P
> >> > + (val))
> >> > +    {
> >> > +      val = force_reg (QImode, val);
> >> > +    }
> >> > +  src = gen_reg_rtx (V16QImode);
> >> > +  emit_insn (gen_aarch64_simd_dupv16qi(src, val));
> >>
> >> I think we should use:
> >>
> >>   src = expand_vector_broadcast (V16QImode, val);
> >>
> >> here (without the CONST_INT_P check), so that for constants we just
> >> move a constant directly into a register.
> >>
> >
> > Sorry to bring this up again. When I tried expand_vector_broadcast, I
> > see the following behaviour:
> > for __builtin_memset(p, 1, 24) where the duplicated constant fits
> >         movi    v0.16b, 0x1
> >         mov     x1, 72340172838076673
> >         str     x1, [x0, 16]
> >         str     q0, [x0]
> > and an ICE for __builtin_memset(p, 1, 32) where I am guessing the
> > duplicated constant does not fit
> > x.c:7:30: error: unrecognizable insn:
> >     7 | { __builtin_memset(p, 1, 32);}
> >       |                              ^
> > (insn 8 7 0 2 (parallel [
> >             (set (mem:V16QI (reg:DI 94) [0 MEM <char[1:32]> [(void 
> > *)p_2(D)]+0
> S16 A8])
> >                 (const_vector:V16QI [
> >                         (const_int 1 [0x1]) repeated x16
> >                     ]))
> >             (set (mem:V16QI (plus:DI (reg:DI 94)
> >                         (const_int 16 [0x10])) [0 MEM <char[1:32]> [(void 
> > *)p_2(D)]+16
> S16 A8])
> >                 (const_vector:V16QI [
> >                         (const_int 1 [0x1]) repeated x16
> >                     ]))
> >         ]) "x.c":7:3 -1
> >      (nil))
> > during RTL pass: vregs
> 
> Ah, yeah, I guess we need to call force_reg on the result.
> 
> >> So yeah, I'm certainly not questioning the speed_p value of 256.
> >> I'm sure you and Wilco have picked the best value for that.  But -Os
> >> stuff can usually be justified on first principles and I wasn't sure
> >> where the value of 128 came from.
> >>
> >
> > I had another chat with Wilco about the 128byte value for !speed_p. We
> > estimate the average number of instructions upto 128byte would be ~3
> > which is similar to do a memset call. But I did go back and think
> > about the tuning argument of
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS a
> > bit more because you are right that based on that the average instructions
> can become double.
> > I would propose using 256/128 based on speed_p but halving the value
> > based on the tune parameter. Obviously the assumption here is that we
> > are respecting the core's choice of avoiding stp of q registers (given
> > that I do not see other uses of
> AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS being changed by -Os).
> 
> Yeah, but I think the lack of an -Os check in the existing code might be a
> mistake.  The point is that STP Q is smaller than two separate STR Qs, so 
> using
> it is a size optimisation even if it's not a speed optimisation.
> And like I say, -Os isn't supposed to be striking a balance between size and
> speed: it's supposed to be going for size quite aggressively.
> 
> So TBH I have slight preference for keeping the current value and only
> checking the tuning flag for speed_p.  But I agree that halving the value
> would be self-consistent, so if you or Wilco believe strongly that halving is
> better, that'd be OK with me too.
> 
> > There might be a debate on how useful
> > AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
> > is in the context of memset/memcpy but that needs more analysis and I
> > would say should be a separate patch.
> 
> Agreed.
> 
> >> >> > +      if (n > 0 && n < copy_limit / 2)
> >> >> > +     {
> >> >> > +       next_mode = smallest_mode_for_size (n, MODE_INT);
> >> >> > +       /* Last 1-byte causes the compiler to optimize to STRB when
> >> >> > +it
> >> >> should
> >> >> > +          use STR Bx, [mem] since we already used SIMD registers.
> >> >> > +          So force it to HImode.  */
> >> >> > +       if (next_mode == QImode)
> >> >> > +         next_mode = HImode;
> >> >>
> >> >> Is this always better?  E.g. for variable inputs and zero it seems
> >> >> quite natural to store the original scalar GPR.
> >> >>
> >> >> If we do do this, I think we should assert before the loop that n > 1.
> >> >>
> >> >> Also, it would be good to cover this case in the tests.
> >> >
> >> > To give a background on this:
> >> > So the case in point here is when we are copying the _last_ 1 byte.
> >> > So the following Void foo (void *p) { __builtin_memset (p, 1, 3); }
> >> > The compiler was generating
> >> >         movi    v0.16b, 0x1
> >> >         mov     w1, 1
> >> >         strb    w1, [x0, 2]
> >> >         str     h0, [x0]
> >> >         ret
> >> > This is because after my expansion in subsequent passes it would
> >> > see (insn 13 12 14 2 (set (reg:QI 99)
> >> >         (subreg:QI (reg:V16QI 98) 0)) "x.c":3:3 -1
> >> >      (nil))
> >> > (insn 14 13 0 2 (set (mem:QI (plus:DI (reg:DI 93)
> >> >                 (const_int 2 [0x2])) [0 MEM <char[1:3]> [(void 
> >> > *)p_2(D)]+2 S1
> A8])
> >> >         (reg:QI 99)) "x.c":3:3 -1
> >> >      (nil))
> >> > And "optimize" it away to strb with an extra mov. Ideally this is a
> >> > separate patch to fix this somewhere between cse1 and fwprop1 and
> emit
> >> >         movi    v0.16b, 0x1
> >> >         str     h0, [x0]
> >> >         str    b0, [x0, 2]
> >> >         ret
> >> > This force to HImode was my temporary workaround for now and we
> >> generate:
> >> >         movi    v0.16b, 0x1
> >> >         str     h0, [x0]
> >> >         str    h0, [x0, 1]
> >> >         ret
> >> >
> >> > I hope this clarifies things.
> >>
> >> Yeah, this was the case I was expecting it was aimed at, and I can
> >> see why we want to do it (at least for -Os).  My concern was more about:
> >>
> >> > After we have used a MOVI/DUP (for variable or non-zero constant
> >> > cases), which is needed for anything above 1 byte memset, it isn't
> >> > really beneficial to use GPR regs.
> >>
> >> I guess this is all very uarch-dependent, but it seemed odd for, say:
> >>
> >>    memset (x, y, 9);
> >>
> >> to generate:
> >>
> >>         dup     v0.8b, w1
> >>         str     q0, [x0]
> >>         strh    h0, [x0, #7]
> >>
> >> with an overlapping store and an extra dependency on the input to the
> >> final store, instead of:
> >>
> >>         dup     v0.8b, w1
> >>         str     q0, [x0]
> >>         strb    w1, [x0, #8]
> >>
> >> > For zero case, the compiler in later passes seems to figure out
> >> > using wzr despite this change but uses strh instead of strb. For
> >> > example for zero
> >> setting 65-bytes.
> >> >         movi    v0.4s, 0
> >> >         stp     q0, q0, [x0, 32]
> >> >         stp     q0, q0, [x0]
> >> >         strh    wzr, [x0, 63]
> >> >         ret
> >>
> >> Here too it just feels more natural to use an strb at offset 64.
> >> I know that's not a good justification though. :-)
> >>
> >> Also, I'm not sure how robust this workaround will be.  The RTL
> >> optimisers might get “smarter” and see through the HImode version too.
> >>
> >
> > Hmm you are right that maybe taking the overlapping store and extra
> > dependency on the input is not ideal. I am happy to keep the original
> > code (no forcing HImode) and add the following 2 cases below in a
> > separate bug report as improvements after the patch gets it.
> >
> > Extra mov with non-zero const.
> >         movi    v0.16b, 0x1
> >         mov     w1, 1
> >         strb    w1, [x0, 2]
> >         str     h0, [x0]
> >
> > Not using strb with var
> >         dup    v0.8b, w1
> >          str     h0, [x0]
> >          str     bo, [x0, 2]
> > (So where I would expect it to prefer STRB, it does the opposite!)
> 
> Huh, yeah.  I guess it's currently not able to see through the subreg.
> 
> Thanks,
> Richard

Apologies for the delay. I have attached another version of the patch.
I have disabled the test cases for ILP32. This is only because function body 
check
fails because there is an addition unsigned extension instruction for src 
pointer in
every test (uxtw    x0, w0). The actual inlining is not different.

Thanks
Sudi

###############     Attachment also inlined for ease of reply    ###############
 
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index 
7a34c841355bad88365381912b163c61c5a35811..2aa3f1fddaafae58f0bfb26e5b33fe6a94e85e06
 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -510,6 +510,7 @@ bool aarch64_emit_approx_div (rtx, rtx, rtx);
 bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
 void aarch64_expand_call (rtx, rtx, rtx, bool);
 bool aarch64_expand_cpymem (rtx *);
+bool aarch64_expand_setmem (rtx *);
 bool aarch64_float_const_zero_rtx_p (rtx);
 bool aarch64_float_const_rtx_p (rtx);
 bool aarch64_function_arg_regno_p (unsigned);
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 
00b5f8438863bb52c348cfafd5d4db478fe248a7..2b7d8e129991dd7bb91fae6b9a561a7b0de7a855
 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1024,16 +1024,19 @@ typedef struct
 #define MOVE_RATIO(speed) \
   (!STRICT_ALIGNMENT ? 2 : (((speed) ? 15 : AARCH64_CALL_RATIO) / 2))
 
-/* For CLEAR_RATIO, when optimizing for size, give a better estimate
-   of the length of a memset call, but use the default otherwise.  */
+/* Like MOVE_RATIO, without -mstrict-align, make decisions in "setmem" when
+   we would use more than 3 scalar instructions.
+   Otherwise follow a sensible default: when optimizing for size, give a better
+   estimate of the length of a memset call, but use the default otherwise.  */
 #define CLEAR_RATIO(speed) \
-  ((speed) ? 15 : AARCH64_CALL_RATIO)
+  (!STRICT_ALIGNMENT ? 4 : (speed) ? 15 : AARCH64_CALL_RATIO)
 
-/* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant, so when
-   optimizing for size adjust the ratio to account for the overhead of loading
-   the constant.  */
+/* SET_RATIO is similar to CLEAR_RATIO, but for a non-zero constant.  Without
+   -mstrict-align, make decisions in "setmem".  Otherwise follow a sensible
+   default:  when optimizing for size adjust the ratio to account for the
+   overhead of loading the constant.  */
 #define SET_RATIO(speed) \
-  ((speed) ? 15 : AARCH64_CALL_RATIO - 2)
+  (!STRICT_ALIGNMENT ? 0 : (speed) ? 15 : AARCH64_CALL_RATIO - 2)
 
 /* Disable auto-increment in move_by_pieces et al.  Use of auto-increment is
    rarely a good idea in straight-line code since it adds an extra address
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
db991e59cbe8c8847f53b86a5b9cf41c799b5ce7..6d88c125ce77e5613acea37f421865119d1b87ae
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7031,6 +7031,9 @@ aarch64_gen_store_pair (machine_mode mode, rtx mem1, rtx 
reg1, rtx mem2,
     case E_V4SImode:
       return gen_vec_store_pairv4siv4si (mem1, reg1, mem2, reg2);
 
+    case E_V16QImode:
+      return gen_vec_store_pairv16qiv16qi (mem1, reg1, mem2, reg2);
+
     default:
       gcc_unreachable ();
     }
@@ -21289,6 +21292,134 @@ aarch64_expand_cpymem (rtx *operands)
   return true;
 }
 
+/* Like aarch64_copy_one_block_and_progress_pointers, except for memset where
+   *src is a register we have created with the duplicated value to be set.  */
+static void
+aarch64_set_one_block_and_progress_pointer (rtx src, rtx *dst,
+                                           machine_mode mode)
+{
+  /* If we are copying 128bits or 256bits, we can do that straight from
+     the SIMD register we prepared.  */
+  if (known_eq (GET_MODE_BITSIZE (mode), 256))
+    {
+      mode = GET_MODE (src);
+      /* "Cast" the *dst to the correct mode.  */
+      *dst = adjust_address (*dst, mode, 0);
+      /* Emit the memset.  */
+      emit_insn (aarch64_gen_store_pair (mode, *dst, src,
+                                        aarch64_progress_pointer (*dst), src));
+
+      /* Move the pointers forward.  */
+      *dst = aarch64_move_pointer (*dst, 32);
+      return;
+    }
+  if (known_eq (GET_MODE_BITSIZE (mode), 128))
+    {
+      /* "Cast" the *dst to the correct mode.  */
+      *dst = adjust_address (*dst, GET_MODE (src), 0);
+      /* Emit the memset.  */
+      emit_move_insn (*dst, src);
+      /* Move the pointers forward.  */
+      *dst = aarch64_move_pointer (*dst, 16);
+      return;
+    }
+  /* For copying less, we have to extract the right amount from src.  */
+  rtx reg = lowpart_subreg (mode, src, GET_MODE (src));
+
+  /* "Cast" the *dst to the correct mode.  */
+  *dst = adjust_address (*dst, mode, 0);
+  /* Emit the memset.  */
+  emit_move_insn (*dst, reg);
+  /* Move the pointer forward.  */
+  *dst = aarch64_progress_pointer (*dst);
+}
+
+/* Expand setmem, as if from a __builtin_memset.  Return true if
+   we succeed, otherwise return false.  */
+
+bool
+aarch64_expand_setmem (rtx *operands)
+{
+  int n, mode_bits;
+  unsigned HOST_WIDE_INT len;
+  rtx dst = operands[0];
+  rtx val = operands[2], src;
+  rtx base;
+  machine_mode cur_mode = BLKmode, next_mode;
+
+  /* We can't do anything smart if the amount to copy is not constant.  */
+  if (!CONST_INT_P (operands[1]))
+    return false;
+
+  bool speed_p = !optimize_function_for_size_p (cfun);
+
+  /* Default the maximum to 256-bytes.  */
+  unsigned max_set_size = 256;
+
+  /* In case we are optimizing for size or if the core does not
+     want to use STP Q regs, lower the max_set_size.  */
+  max_set_size = (!speed_p
+                 || (aarch64_tune_params.extra_tuning_flags
+                     & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
+                 ? max_set_size/2 : max_set_size;
+
+  len = INTVAL (operands[1]);
+
+  /* Upper bound check.  */
+  if (len > max_set_size)
+    return false;
+
+  base = copy_to_mode_reg (Pmode, XEXP (dst, 0));
+  dst = adjust_automodify_address (dst, VOIDmode, base, 0);
+
+  /* Prepare the val using a DUP/MOVI v0.16B, val.  */
+  src = expand_vector_broadcast (V16QImode, val);
+  src = force_reg (V16QImode, src);
+
+  /* Convert len to bits to make the rest of the code simpler.  */
+  n = len * BITS_PER_UNIT;
+
+  /* Maximum amount to copy in one go.  We allow 256-bit chunks based on the
+     AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS tuning parameter.  setmem expand
+     pattern is only turned on for TARGET_SIMD.  */
+  const int copy_limit = (speed_p
+                         && (aarch64_tune_params.extra_tuning_flags
+                             & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))
+                         ? GET_MODE_BITSIZE (TImode) : 256;
+
+  while (n > 0)
+    {
+      /* Find the largest mode in which to do the copy in without
+        over writing.  */
+      opt_scalar_int_mode mode_iter;
+      FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
+       if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
+         cur_mode = mode_iter.require ();
+
+      gcc_assert (cur_mode != BLKmode);
+
+      mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
+      aarch64_set_one_block_and_progress_pointer (src, &dst, cur_mode);
+
+      n -= mode_bits;
+
+      /* Do certain trailing copies as overlapping if it's going to be
+        cheaper.  i.e. less instructions to do so.  For instance doing a 15
+        byte copy it's more efficient to do two overlapping 8 byte copies than
+        8 + 4 + 2 + 1.  */
+      if (n > 0 && n < copy_limit / 2)
+       {
+         next_mode = smallest_mode_for_size (n, MODE_INT);
+         int n_bits = GET_MODE_BITSIZE (next_mode).to_constant ();
+         dst = aarch64_move_pointer (dst, (n - n_bits) / BITS_PER_UNIT);
+         n = n_bits;
+       }
+    }
+
+  return true;
+}
+
+
 /* Split a DImode store of a CONST_INT SRC to MEM DST as two
    SImode stores.  Handle the case when the constant has identical
    bottom and top halves.  This is beneficial when the two stores can be
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
78fe7c43a00432861e59f19330dacec234b58875..f0125271586831ba7cbf66d5282ef732b16c47f6
 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -1564,6 +1564,24 @@
 }
 )
 
+;; 0 is dst
+;; 1 is val
+;; 2 is size of copy in bytes
+;; 3 is alignment
+
+(define_expand "setmemdi"
+  [(set (match_operand:BLK 0 "memory_operand")     ;; Dest
+        (match_operand:QI  2 "nonmemory_operand")) ;; Value
+   (use (match_operand:DI  1 "immediate_operand")) ;; Length
+   (match_operand          3 "immediate_operand")] ;; Align
+  "TARGET_SIMD"
+{
+  if (aarch64_expand_setmem (operands))
+    DONE;
+
+  FAIL;
+})
+
 ;; Operands 1 and 3 are tied together by the final condition; so we allow
 ;; fairly lax checking on the second memory operation.
 (define_insn "load_pair_sw_<SX:mode><SX2:mode>"
diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr90883.C 
b/gcc/testsuite/g++.dg/tree-ssa/pr90883.C
index 
0e622f263d2697e22999512142d5296d59af479a..37df17d0b1668d8b0410f7c28b5291147c7d2ad2
 100644
--- a/gcc/testsuite/g++.dg/tree-ssa/pr90883.C
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr90883.C
@@ -15,6 +15,6 @@
 
 // We want to match enough here to capture that we deleted an empty
 // constructor store
-// aarch64 and mips will expand to loop to clear because CLEAR_RATIO.
-// { dg-final { scan-tree-dump "Deleted redundant store: .*\.a = {}" "dse1" { 
xfail { aarch64-*-* mips*-*-* } } } }
+// mips will expand to loop to clear because CLEAR_RATIO.
+// { dg-final { scan-tree-dump "Deleted redundant store: .*\.a = {}" "dse1" { 
xfail { mips*-*-* } } } }
 
diff --git a/gcc/testsuite/gcc.dg/tree-prof/stringop-2.c 
b/gcc/testsuite/gcc.dg/tree-prof/stringop-2.c
index 
b7471bffd9159e560543646e0a6e66ecd00bd6ef..e8b1644e2ba83a9da8bb9281158a3cfb5f04c2db
 100644
--- a/gcc/testsuite/gcc.dg/tree-prof/stringop-2.c
+++ b/gcc/testsuite/gcc.dg/tree-prof/stringop-2.c
@@ -20,6 +20,6 @@ main()
    return 0;
 }
 /* autofdo doesn't support value profiling for now: */
-/* { dg-final-use-not-autofdo { scan-ipa-dump "Transformation done: single 
value 4 stringop" "profile"} } */
+/* { dg-final-use-not-autofdo { scan-ipa-dump "Transformation done: single 
value 4 stringop" "profile" { target { ! aarch64*-*-* } } } } */
 /* The versioned memset of size 4 should be optimized to an assignment.
-   { dg-final-use-not-autofdo { scan-tree-dump "MEM <\[a-z \]+> \\\[\\(void 
.\\)&a\\\] = 168430090" "optimized" } } */
+   { dg-final-use-not-autofdo { scan-tree-dump "MEM <\[a-z \]+> \\\[\\(void 
.\\)&a\\\] = 168430090" "optimized" { target { ! aarch64*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-corner-cases.c 
b/gcc/testsuite/gcc.target/aarch64/memset-corner-cases.c
new file mode 100644
index 
0000000000000000000000000000000000000000..c43f0199adcd348370edabf045a532e9abb436e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-corner-cases.c
@@ -0,0 +1,88 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target lp64 } */
+
+#include <stdint.h>
+
+/* One byte variable set should be scalar
+**set1byte:
+**     strb    w1, \[x0\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set1byte (int64_t *src, char c)
+{
+  __builtin_memset (src, c, 1);
+}
+
+/* Special cases for setting 0.  */
+/* 1-byte should be STRB with wzr
+**set0byte:
+**     strb    wzr, \[x0\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set0byte (int64_t *src)
+{
+  __builtin_memset (src, 0, 1);
+}
+
+/* 35bytes would become 4 scalar instructions.  So favour NEON.
+**set0neon:
+**     movi    v0.4s, 0
+**     stp     q0, q0, \[x0\]
+**     str     wzr, \[x0, 31\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set0neon (int64_t *src)
+{
+  __builtin_memset (src, 0, 35);
+}
+
+/* 36bytes should be scalar however.
+**set0scalar:
+**     stp     xzr, xzr, \[x0\]
+**     stp     xzr, xzr, \[x0, 16\]
+**     str     wzr, \[x0, 32\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set0scalar (int64_t *src)
+{
+  __builtin_memset (src, 0, 36);
+}
+
+
+/* 256-bytes expanded
+**set256byte:
+**     dup     v0.16b, w1
+**     stp     q0, q0, \[x0\]
+**     stp     q0, q0, \[x0, 32\]
+**     stp     q0, q0, \[x0, 64\]
+**     stp     q0, q0, \[x0, 96\]
+**     stp     q0, q0, \[x0, 128\]
+**     stp     q0, q0, \[x0, 160\]
+**     stp     q0, q0, \[x0, 192\]
+**     stp     q0, q0, \[x0, 224\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set256byte (int64_t *src, char c)
+{
+  __builtin_memset (src, c, 256);
+}
+
+/* More than 256 bytes goes to memset
+**set257byte:
+**     mov     x2, 257
+**     mov     w1, 99
+**     b       memset
+*/
+void __attribute__((__noinline__))
+set257byte (int64_t *src)
+{
+  __builtin_memset (src, 'c', 257);
+}
+
+/* { dg-final { check-function-bodies "**" "" "" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/memset-q-reg.c 
b/gcc/testsuite/gcc.target/aarch64/memset-q-reg.c
new file mode 100644
index 
0000000000000000000000000000000000000000..156146badbcd98e63d873a4a1c7657f19c027973
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/memset-q-reg.c
@@ -0,0 +1,81 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+/* { dg-require-effective-target lp64 } */
+
+#include <stdint.h>
+
+/*
+**set128bits:
+**     dup     v0.16b, w1
+**     str     q0, \[x0\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set128bits (int64_t *src, char c)
+{
+  __builtin_memset (src, c, 2*sizeof(int64_t));
+}
+
+/*
+**set128bitszero:
+**     stp     xzr, xzr, \[x0\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set128bitszero (int64_t *src)
+{
+  __builtin_memset (src, 0, 2*sizeof(int64_t));
+}
+
+/*
+** set128bitsplus:
+**     dup     v0.16b, w1
+**     str     q0, \[x0\]
+**     str     q0, \[x0, 12\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set128bitsplus (int64_t *src, char c)
+{
+  __builtin_memset (src, c, 7*sizeof(int32_t));
+}
+
+/*
+** set256bits:
+**     movi    v0.16b, 0x63
+**     stp     q0, q0, \[x0\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set256bits (int64_t *src)
+{
+  __builtin_memset (src, 'c', 4*sizeof(int64_t));
+}
+
+/*
+**set256bitszero:
+**     stp     xzr, xzr, \[x0\]
+**     stp     xzr, xzr, \[x0, 16\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set256bitszero (int64_t *src)
+{
+  __builtin_memset (src, 0, 4*sizeof(int64_t));
+}
+
+/*
+** set256bitsplus:
+**     movi    v0.16b, 0x63
+**     stp     q0, q0, \[x0\]
+**     str     q0, \[x0, 32\]
+**     str     d0, \[x0, 48\]
+**     ret
+*/
+void __attribute__((__noinline__))
+set256bitsplus (int64_t *src)
+{
+  __builtin_memset (src, 'c', 7*sizeof(int64_t));
+}
+
+/* { dg-final { check-function-bodies "**" "" "" } } */

Attachment: rb13675.patch
Description: rb13675.patch

Reply via email to