Re: [PATCH ARM] Improve ARM memset inlining

Ramana Radhakrishnan Fri, 27 Jun 2014 01:22:07 -0700

On Tue, May 6, 2014 at 5:59 AM, bin.cheng <bin.ch...@arm.com> wrote:
>
>
>> -----Original Message-----
>> From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
>> ow...@gcc.gnu.org] On Behalf Of bin.cheng
>> Sent: Monday, May 05, 2014 3:21 PM
>> To: Richard Earnshaw
>> Cc: gcc-patches@gcc.gnu.org
>> Subject: RE: [PATCH ARM] Improve ARM memset inlining
>>
>> Hi Richard,  Thanks for reviewing.  I embedded answers to your comments,
>> also updated the patch.
>>
>> > -----Original Message-----
>> > From: Richard Earnshaw
>> > Sent: Friday, May 02, 2014 10:00 PM
>> > To: Bin Cheng
>> > Cc: gcc-patches@gcc.gnu.org
>> > Subject: Re: [PATCH ARM] Improve ARM memset inlining
>> >
>> > On 30/04/14 03:52, bin.cheng wrote:
>> > > Hi,
>> > > This patch expands small memset calls into direct memory set
>> > > instructions by introducing "setmemsi" pattern.  For processors
>> > > without NEON support, it expands memset using general store
>> > > instruction.  For example, strd for 4-bytes aligned addresses.  For
>> > > processors with NEON support, it expands memset using neon
>> > > instructions like vstr and miscellaneous vst1.* instructions for
>> > > both
>> aligned
>> > and unaligned cases.
>> > >
>> > > This patch depends on
>> > > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg01923.html otherwise
>> > > vst1.64 will be generated for 32-bit aligned memory unit.
>> > >
>> > > There is also one leftover work of this patch:  Since vst1.*
>> > > instructions only support post-increment addressing mode, the
>> > > inlined memset for unaligned neon cases should be like:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]!
>> > >   vst1.8     {q8}, [r3]
>> >
>> > Other than for zero, I'd expect the vmov to be vmov.i8 to move an
>> arbitrary
>> I just used vmov.i32 as an example.  The element size is actually
> calculated by
>> function neon_valid_immediate which works as expected I think.
>>
>> > byte value into all lanes in a vector.  After that, if the alignment
>> > is
>> known to
>> > be more than 8-bit, I'd expect the vst1 instructions (with the
>> > exception
>> of the
>> > last store if the length is not a multiple of the alignment) to use
>> >
>> >     vst1.<align> {reg}, [addr-reg :<align>]!
>> >
>> > Hence, for 16-bit aligned data, we want
>> >
>> >     vst1.16 {q8}, [r3:16]!
>> Did I miss something important?  It seems to me the explicit alignment
> notes
>> supported are 64/128/256.  So what do you mean by 16 bits alignment here?
>>
>> >
>> > > But for now, gcc can't do this and below code is generated:
>> > >   vmov.i32   q8, #...
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   add        r3,   r2,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >   vst1.8     {q8}, [r3]
>> > >   add        r2,   r3,  #16
>> > >   vst1.8     {q8}, [r2]
>> > >
>> > > I investigated this issue.  The root cause lies in rtx cost returned
>> > > by ARM backend.  Anyway, I think this is another issue and should be
>> > > fixed in separated patch.


Ok looks like Charles B from Linaro has run into the same thing and
has some fixes to suggest in costs.

>> > >
>> > > Bootstrap and reg-test on cortex-a15, with or without neon support.
>> > > Is it OK?
>> > >
>> >
>> > Some more comments inline.
>> >
>> > > Thanks,
>> > > bin
>> > >
>> > >
>> > > 2014-04-29  Bin Cheng  <bin.ch...@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * config/arm/arm.md (setmem): New pattern.
>> > >   * config/arm/arm-protos.h (struct tune_params): New field.
>> > >   (arm_gen_setmem): New prototype.
>> > >   * config/arm/arm.c (arm_slowmul_tune): Initialize new field.
>> > >   (arm_fastmul_tune, arm_strongarm_tune, arm_xscale_tune): Ditto.
>> > >   (arm_9e_tune, arm_v6t2_tune, arm_cortex_tune): Ditto.
>> > >   (arm_cortex_a8_tune, arm_cortex_a7_tune): Ditto.
>> > >   (arm_cortex_a15_tune, arm_cortex_a53_tune): Ditto.
>> > >   (arm_cortex_a57_tune, arm_cortex_a5_tune): Ditto.
>> > >   (arm_cortex_a9_tune, arm_cortex_a12_tune): Ditto.
>> > >   (arm_v7m_tune, arm_v6m_tune, arm_fa726te_tune): Ditto.
>> > >   (arm_const_inline_cost): New function.
>> > >   (arm_block_set_max_insns): New function.
>> > >   (arm_block_set_straight_profit_p): New function.
>> > >   (arm_block_set_vect_profit_p): New function.
>> > >   (arm_block_set_unaligned_vect): New function.
>> > >   (arm_block_set_aligned_vect): New function.
>> > >   (arm_block_set_unaligned_straight): New function.
>> > >   (arm_block_set_aligned_straight): New function.
>> > >   (arm_block_set_vect, arm_gen_setmem): New functions.
>> > >
>> > > gcc/testsuite/ChangeLog
>> > > 2014-04-29  Bin Cheng  <bin.ch...@arm.com>
>> > >
>> > >   PR target/55701
>> > >   * gcc.target/arm/memset-inline-1.c: New test.
>> > >   * gcc.target/arm/memset-inline-2.c: New test.
>> > >   * gcc.target/arm/memset-inline-3.c: New test.
>> > >   * gcc.target/arm/memset-inline-4.c: New test.
>> > >   * gcc.target/arm/memset-inline-5.c: New test.
>> > >   * gcc.target/arm/memset-inline-6.c: New test.
>> > >   * gcc.target/arm/memset-inline-7.c: New test.
>> > >   * gcc.target/arm/memset-inline-8.c: New test.
>> > >   * gcc.target/arm/memset-inline-9.c: New test.
>> > >
>> > >
>> > > j1328-20140429.txt
>> > >
>> > >
>> > > Index: gcc/config/arm/arm.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.c  (revision 209852)
>> > > +++ gcc/config/arm/arm.c  (working copy)
>> > > @@ -1585,10 +1585,11 @@ const struct tune_params arm_slowmul_tune
>> =
>> > >    true,                                          /* Prefer constant
>> > pool.  */
>> > >    arm_default_branch_cost,
>> > >    false,                                 /* Prefer LDRD/STRD.  */
>> > > -  {true, true},                                  /* Prefer non short
>> > circuit.  */
>> > > -  &arm_default_vec_cost,                        /* Vectorizer costs.
>> */
>> > > -  false,                                        /* Prefer Neon for
>> 64-bits bitops.  */
>> > > -  false, false                                  /* Prefer 32-bit
>> encodings.  */
>> > > +  {true, true},                          /* Prefer non short circuit.
>> */
>> > > +  &arm_default_vec_cost,                /* Vectorizer costs.  */
>> > > +  false,                                /* Prefer Neon for 64-bits
>> bitops.  */
>> > > +  false, false,                         /* Prefer 32-bit encodings.
> */
>> > > +  false                                 /* Prefer Neon for stringops.
>> */
>> > >  };
>> > >
>> >
>> > Please make sure that all the white space before the comments is using
>> TAB,
>> > not spaces.  Similarly for the other tables.
>> Fixed.
>>
>> >
>> > > @@ -16788,6 +16806,14 @@ arm_const_double_inline_cost (rtx val)
>> > >                         NULL_RTX, NULL_RTX, 0, 0));  }
>> > >
>> > > +/* Cost of loading a SImode constant.  */ static inline int
>> > > +arm_const_inline_cost (rtx val) {
>> > > +  return arm_gen_constant (SET, SImode, NULL_RTX, INTVAL (val),
>> > > +                           NULL_RTX, NULL_RTX, 0, 0); }
>> > > +
>> >
>> > This could be used more widely if you passed the SET in as a parameter
>> > (there are cases in arm_new_rtx_cost that could use it, for example).
>> > Also, you want to enable sub-targets (only once you can't create new
>> > pseudos is that not safe), so the penultimate argument in the call to
>> > arm_gen_constant should be 1.
>> Fixed.



>>
>> >
>> > >  /* Return true if it is worthwhile to split a 64-bit constant into
> two
>> > >     32-bit operations.  This is the case if optimizing for size, or
>> > >     if we have load delay slots, or if one 32-bit part can be done
>> > > with @@ -31350,6 +31383,504 @@ arm_validize_comparison (rtx
>> > > *comparison, rtx * op
>> > >
>> > >  }
>> > >
>> > > +/* Maximum number of instructions to set block of memory.  */
>> > > +static int arm_block_set_max_insns (void) {
>> > > +  return (optimize_function_for_size_p (cfun) ? 4 : 8); }
>> >
>> > I think the non-size_p alternative should really be a parameter in the
>> per-cpu
>> > costs table.
>> Fixed.
>>
>> >
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for straight
>> > > +   case.  */
>> >
>> > "Straight" is confusing here.  Do you mean non-vectorized?  If so,
>> > then non_vect might be clearer.
>> Fixed.
>>
>> >
>> > The arguments should really be documented (see comment below about
>> > align, for example).
>> Fixed.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_straight_profit_p (rtx val,
>> > > +                          unsigned HOST_WIDE_INT length,
>> > > +                          unsigned HOST_WIDE_INT align,
>> > > +                          bool unaligned_p, bool use_strd_p) {
>> > > +  int num = 0;
>> > > +  /* For leftovers in bytes of 0-7, we can set the memory block using
>> > > +     strb/strh/str with minimum instruction number.  */
>> > > +  int leftover[8] = {0, 1, 1, 2, 1, 2, 2, 3};
>> >
>> > This should be marked const.
>> Fixed.
>>
>> >
>> > > +
>> > > +  if (unaligned_p)
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += length / align + length % align;
>> >
>> > Isn't align in bits here, when you really want it in bytes?
>> All alignments are in bytes starting from pattern "setmem".
>>
>> >
>> > What if align > 4 bytes?
>> Then it's the "!unaligned_p" case and handled by other arms of this if
>> statement.
>>
>> >
>> > > +    }
>> > > +  else if (use_strd_p)
>> > > +    {
>> > > +      num = arm_const_double_inline_cost (val);
>> > > +      num += (length >> 3) + leftover[length & 7];
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      num = arm_const_inline_cost (val);
>> > > +      num += (length >> 2) + leftover[length & 3];
>> > > +    }
>> > > +
>> > > +  /* We may be able to combine last pair STRH/STRB into a single STR
>> > > +     by shifting one byte back.  */  if (unaligned_access && length
>> > > + > 3 && (length & 3) == 3)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Return TRUE if it's profitable to set block of memory for vector
>> > > +case.  */ static bool arm_block_set_vect_profit_p (unsigned
>> > > +HOST_WIDE_INT length,
>> > > +                      unsigned HOST_WIDE_INT align
>> > ATTRIBUTE_UNUSED,
>> > > +                      bool unaligned_p, enum machine_mode mode)
>> >
>> > I'm not sure what you mean by unaligned here.  Again, documenting the
>> > arguments might help.
>> Fixed.
>>
>> >
>> > > +{
>> > > +  int num;
>> > > +  unsigned int nelt = GET_MODE_NUNITS (mode);
>> > > +
>> > > +  /* Num of instruction loading constant value.  */
>> >
>> > Use either "Number" or, in this case, simply drop that bit and write:
>> >   /* Instruction loading constant value.  */
>> Fixed.
>>
>> >
>> > > +  num = 1;
>> > > +  /* Num of store instructions.  */
>> >
>> > Likewise.
>> >
>> > > +  num += (length + nelt - 1) / nelt;
>> > > +  /* Num of address adjusting instructions.  */
>> >
>> > Can't we work on the premise that the address adjusting instructions
>> > will
>> be
>> > merged into the stores?  I know you said that they currently do not,
>> > but that's not a problem that this bit of code should have to worry
> about.
>> Fixed.
>>
>> >
>> > > +  if (unaligned_p)
>> > > +    /* For unaligned case, it's one less than the store instructions.
>> */
>> > > +    num += (length + nelt - 1) / nelt - 1;  else if ((length & 3)
>> > > + !=
>> > > + 0)
>> > > +    /* For aligned case, it's one if bytes leftover can only be
> stored
>> > > +       by mis-aligned store instruction.  */
>> > > +    num++;
>> > > +
>> > > +  /* Store the first 16 bytes using vst1:v16qi for the aligned case.
>> > > + */  if (!unaligned_p && mode == V16QImode)
>> > > +    num--;
>> > > +
>> > > +  return (num <= arm_block_set_max_insns ()); }
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   unaligned case.  We fill the first LENGTH bytes of the memory
>> > > +   area starting from DSTBASE with byte constant VALUE.  ALIGN is
>> > > +   the alignment requirement of memory.  */
>> >
>> > What's the return value mean?
>> Documented.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_unaligned_vect (rtx dstbase,
>> > > +                       unsigned HOST_WIDE_INT length,
>> > > +                       unsigned HOST_WIDE_INT value,
>> > > +                       unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v16, nelt_v8, nelt_mode;
>> >
>> > Don't mix initialized declarations with unitialized ones on the same
> line.
>> You
>> > don't appear to use either I or J until their first use in the loop
>> control below,
>> > so why initialize them here?
>> Fixed.
>>
>> >
>> > > +  rtx dst, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  rtx (*gen_func) (rtx, rtx);
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) != 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16)
>> > > +    {
>> > > +      mode = V16QImode;
>> > > +      gen_func = gen_movmisalignv16qi;
>> > > +    }
>> > > +  else
>> > > +    {
>> > > +      mode = V8QImode;
>> > > +      gen_func = gen_movmisalignv8qi;
>> > > +    }
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, true, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mem =
>> > > + adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> >
>> > Is this the first use of J?  If so, initialize it here.
>> >
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle nelt_mode bytes in a vector.  */  for (; (i + nelt_mode
>> > > + <= length); i += nelt_mode)
>> >
>> > Similarly for I.
>> >
>> > > +    {
>> > > +      emit_insn ((*gen_func) (mem, reg));
>> > > +      if (i + 2 * nelt_mode <= length)
>> > > + emit_insn (gen_add2_insn (dst, GEN_INT (nelt_mode)));
>> > > +    }
>> > > +
>> > > +  if (i + nelt_v8 <= length)
>> > > +    gcc_assert (mode == V16QImode);
>> >
>> > Why not drop the if and write:
>> >
>> >      gcc_assert ((i + nelt_v8) > length || mode == V16QImode);
>> Fixed.
>>
>> >
>> > > +
>> > > +  /* Handle (8, 16) bytes leftover.  */  if (i + nelt_v8 < length)
>> >
>> > Your assertion above checked <=, but here you use <.  Is that correct?
>> Yes, it is.  For case "==", it means we have nelt_v8 bytes leftover, which
> will
>> be handled by the last branch of if statement.
>>
>> >
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - i)));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +    }
>> > > +  /* Handle (0, 8] bytes leftover.  */
>> > > +  else if (i < length && i + nelt_v8 >= length)
>> > > +    {
>> > > +      if (mode == V16QImode)
>> > > + {
>> > > +   reg = gen_lowpart (V8QImode, reg);
>> > > +   mem = adjust_automodify_address (dstbase, V8QImode, dst, 0);
>> > > + }
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT ((length - i)
>> > > +                                       + (nelt_mode - nelt_v8))));
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) != 0 && align >= 2)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for the
>> > > +   aligned case.  We fill the first LENGTH bytes of the memory area
>> > > +   starting from DSTBASE with byte constant VALUE.  ALIGN is the
>> > > +   alignment requirement of memory.  */
>> >
>> > See all the comments above for the unaligend case.
>> Fixed accordingly.
>>
>> >
>> > > +static bool
>> > > +arm_block_set_aligned_vect (rtx dstbase,
>> > > +                     unsigned HOST_WIDE_INT length,
>> > > +                     unsigned HOST_WIDE_INT value,
>> > > +                     unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i = 0, j = 0, nelt_v8, nelt_v16, nelt_mode;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_elt, val_vec, reg;
>> > > +  rtx rval[MAX_VECT_LEN];
>> > > +  enum machine_mode mode;
>> > > +  unsigned HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert ((align & 0x3) == 0);
>> > > +  nelt_v8 = GET_MODE_NUNITS (V8QImode);
>> > > +  nelt_v16 = GET_MODE_NUNITS (V16QImode);  if (length >= nelt_v16
>> > > + && unaligned_access && !BYTES_BIG_ENDIAN)
>> > > +    mode = V16QImode;
>> > > +  else
>> > > +    mode = V8QImode;
>> > > +
>> > > +  nelt_mode = GET_MODE_NUNITS (mode);  gcc_assert (length >=
>> > > + nelt_mode);
>> > > +  /* Skip if it isn't profitable.  */  if
>> > > + (!arm_block_set_vect_profit_p (length, align, false, mode))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_elt = GEN_INT (v);
>> > > +  for (; j < nelt_mode; j++)
>> > > +    rval[j] = val_elt;
>> > > +
>> > > +  reg = gen_reg_rtx (mode);
>> > > +  val_vec = gen_rtx_CONST_VECTOR (mode, gen_rtvec_v (nelt_mode,
>> > > + rval));
>> > > +  /* Emit instruction loading the constant value.  */
>> > > + emit_move_insn (reg, val_vec);
>> > > +
>> > > +  /* Handle first 16 bytes specially using vst1:v16qi instruction.
>> > > +*/
>> > > +  if (mode == V16QImode)
>> > > +    {
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +      i += nelt_mode;
>> > > +      /* Handle (8, 16) bytes leftover using vst1:v16qi again.  */
>> > > +      if (i + nelt_v8 < length && i + nelt_v16 > length)
>> > > + {
>> > > +   emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +   mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +   /* We are shifting bytes back, set the alignment accordingly.  */
>> > > +   if ((length & 0x3) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 4);
>> > > +   else if ((length & 0x1) == 0)
>> > > +     set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +   else
>> > > +     set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +   emit_insn (gen_movmisalignv16qi (mem, reg));
>> > > +   return true;
>> > > + }
>> > > +      /* Fall through for bytes leftover.  */
>> > > +      mode = V8QImode;
>> > > +      nelt_mode = GET_MODE_NUNITS (mode);
>> > > +      reg = gen_lowpart (V8QImode, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle 8 bytes in a vector.  */  for (; (i + nelt_mode <=
>> > > + length); i += nelt_mode)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single word leftover by shifting 4 bytes back.  We can
>> > > +     use aligned access for this case.  */
>> > > +  if (i + UNITS_PER_WORD == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - UNITS_PER_WORD);
>> > > +      mem = adjust_automodify_address (dstbase, mode,
>> > > +                                addr, i - UNITS_PER_WORD);
>> > > +      /* We are shifting 4 bytes back, set the alignment accordingly.
>> */
>> > > +      if (align > UNITS_PER_WORD)
>> > > + set_mem_align (mem, BITS_PER_UNIT * UNITS_PER_WORD);
>> > > +
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +  /* Handle (0, 4), (4, 8) bytes leftover by shifting bytes back.
>> > > +     We have to use unaligned access for this case.  */
>> > > +  else if (i < length)
>> > > +    {
>> > > +      emit_insn (gen_add2_insn (dst, GEN_INT (length - nelt_mode)));
>> > > +      mem = adjust_automodify_address (dstbase, mode, dst, 0);
>> > > +      /* We are shifting bytes back, set the alignment accordingly.
> */
>> > > +      if ((length & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT * 2);
>> > > +      else
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      emit_insn (gen_movmisalignv8qi (mem, reg));
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strh/strb instructions, only
>> > > +   using instructions allowed by ALIGN on processor.  We fill the
>> > > +   first LENGTH bytes of the memory area starting from DSTBASE
>> > > +   with byte constant VALUE.  ALIGN is the alignment requirement
>> > > +   of memory.  */
>> > > +static bool
>> > > +arm_block_set_unaligned_straight (rtx dstbase,
>> > > +                           unsigned HOST_WIDE_INT length,
>> > > +                           unsigned HOST_WIDE_INT value,
>> > > +                           unsigned HOST_WIDE_INT align) {
>> > > +  unsigned int i;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  enum machine_mode mode;
>> > > +  HOST_WIDE_INT v = value;
>> > > +
>> > > +  gcc_assert (align == 1 || align == 2);
>> > > +
>> > > +  if (align == 2)
>> > > +    v |= (value << BITS_PER_UNIT);
>> > > +
>> > > +  v = sext_hwi (v, BITS_PER_WORD);
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, true, false))
>> > > +    return false;
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));  mode = (align == 2 ?
>> > > + HImode : QImode);  val_reg = force_reg (SImode, val_exp);  reg =
>> > > + gen_lowpart (mode, val_reg);
>> > > +
>> > > +  for (i = 0; (i + GET_MODE_SIZE (mode) <= length); i +=
>> > > + GET_MODE_SIZE
>> > (mode))
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, mode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +      i++;
>> > > +    }
>> > > +
>> > > +  gcc_assert (i == length);
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using plain strd/str/strh/strb instructions,
>> > > +   to permit unaligned copies on processors which support unaligned
>> > > +   semantics for those instructions.  We fill the first LENGTH bytes
>> > > +   of the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_aligned_straight (rtx dstbase,
>> > > +                         unsigned HOST_WIDE_INT length,
>> > > +                         unsigned HOST_WIDE_INT value,
>> > > +                         unsigned HOST_WIDE_INT align)
>> > > +{
>> > > +  unsigned int i = 0;
>> > > +  rtx dst, addr, mem;
>> > > +  rtx val_exp, val_reg, reg;
>> > > +  unsigned HOST_WIDE_INT v;
>> > > +  bool use_strd_p;
>> > > +
>> > > +  use_strd_p = (length >= 2 * UNITS_PER_WORD && (align & 3) == 0
>> > > +         && TARGET_LDRD && current_tune->prefer_ldrd_strd);
>> > > +
>> > > +  v = (value | (value << 8) | (value << 16) | (value << 24));  if
>> > > + (length < UNITS_PER_WORD)
>> > > +    v &= (0xFFFFFFFF >> (UNITS_PER_WORD - length) * BITS_PER_UNIT);
>> > > +
>> > > +  if (use_strd_p)
>> > > +    v |= (v << BITS_PER_WORD);
>> > > +  else
>> > > +    v = sext_hwi (v, BITS_PER_WORD);
>> > > +
>> > > +  val_exp = GEN_INT (v);
>> > > +  /* Skip if it isn't profitable.  */
>> > > +  if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                 align, false, use_strd_p))
>> > > +    {
>> > > +      /* Try without strd.  */
>> > > +      v = (v >> BITS_PER_WORD);
>> > > +      v = sext_hwi (v, BITS_PER_WORD);
>> > > +      val_exp = GEN_INT (v);
>> > > +      use_strd_p = false;
>> > > +      if (!arm_block_set_straight_profit_p (val_exp, length,
>> > > +                                     align, false, use_strd_p))
>> > > + return false;
>> > > +    }
>> > > +
>> > > +  dst = copy_addr_to_reg (XEXP (dstbase, 0));
>> > > +  /* Handle double words using strd if possible.  */
>> > > +  if (use_strd_p)
>> > > +    {
>> > > +      val_reg = force_reg (DImode, val_exp);
>> > > +      reg = val_reg;
>> > > +      for (; (i + 8 <= length); i += 8)
>> > > + {
>> > > +   addr = plus_constant (Pmode, dst, i);
>> > > +   mem = adjust_automodify_address (dstbase, DImode, addr, i);
>> > > +   emit_move_insn (mem, reg);
>> > > + }
>> > > +    }
>> > > +  else
>> > > +    val_reg = force_reg (SImode, val_exp);
>> > > +
>> > > +  /* Handle words.  */
>> > > +  reg = (use_strd_p ? gen_lowpart (SImode, val_reg) : val_reg);
>> > > +  for (; (i + 4 <= length); i += 4)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i);
>> > > +      if ((align & 3) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +    }
>> > > +
>> > > +  /* Merge last pair of STRH and STRB into a STR if possible.  */
>> > > + if (unaligned_access && i > 0 && (i + 3) == length)
>> > > +    {
>> > > +      addr = plus_constant (Pmode, dst, i - 1);
>> > > +      mem = adjust_automodify_address (dstbase, SImode, addr, i - 1);
>> > > +      /* We are shifting one byte back, set the alignment
> accordingly.
>> */
>> > > +      if ((align & 1) == 0)
>> > > + set_mem_align (mem, BITS_PER_UNIT);
>> > > +
>> > > +      /* Most likely this is an unaligned access, and we can't tell
> at
>> > > +  compilation time.  */
>> > > +      emit_insn (gen_unaligned_storesi (mem, reg));
>> > > +      return true;
>> > > +    }
>> > > +
>> > > +  /* Handle half word leftover.  */
>> > > +  if (i + 2 <= length)
>> > > +    {
>> > > +      reg = gen_lowpart (HImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, HImode, addr, i);
>> > > +      if ((align & 1) == 0)
>> > > + emit_move_insn (mem, reg);
>> > > +      else
>> > > + emit_insn (gen_unaligned_storehi (mem, reg));
>> > > +
>> > > +      i += 2;
>> > > +    }
>> > > +
>> > > +  /* Handle single byte leftover.  */  if (i + 1 == length)
>> > > +    {
>> > > +      reg = gen_lowpart (QImode, val_reg);
>> > > +      addr = plus_constant (Pmode, dst, i);
>> > > +      mem = adjust_automodify_address (dstbase, QImode, addr, i);
>> > > +      emit_move_insn (mem, reg);
>> > > +    }
>> > > +
>> > > +  return true;
>> > > +}
>> > > +
>> > > +/* Set a block of memory using vectorization instructions for both
>> > > +   aligned and unaligned cases.  We fill the first LENGTH bytes of
>> > > +   the memory area starting from DSTBASE with byte constant VALUE.
>> > > +   ALIGN is the alignment requirement of memory.  */ static bool
>> > > +arm_block_set_vect (rtx dstbase,
>> > > +             unsigned HOST_WIDE_INT length,
>> > > +             unsigned HOST_WIDE_INT value,
>> > > +             unsigned HOST_WIDE_INT align) {
>> > > +  /* Check whether we need to use unaligned store instruction.  */
>> > > +  if (((align & 3) != 0 || (length & 3) != 0)
>> > > +      /* Check whether unaligned store instruction is available.  */
>> > > +      && (!unaligned_access || BYTES_BIG_ENDIAN))
>> > > +    return false;
>> >
>> > Huh!  vst1.8 can work for unaligned accesses even when hw alignment
>> > checking is strict.
>> Emm, All movmisalign patterns are guarded by " !BYTES_BIG_ENDIAN &&
>> unaligned_access", vst1.8 instructions  can't be recognized now in this
> way.
>> I agree that it's too strict, but that's another problem I think.

That was introduced to "fix up" the issue with another test IIRC.
That's probably not related to this particular patch. It's the other
thread that's been ongoing with Maciej, so let's continue it there.

>>
>> >
>> > > +
>> > > +  if ((align & 3) == 0)
>> > > +    return arm_block_set_aligned_vect (dstbase, length, value,
>> > > +align);
>> > > +  else
>> > > +    return arm_block_set_unaligned_vect (dstbase, length, value,
>> > > +align); }
>> > > +
>> > > +/* Expand string store operation.  Firstly we try to do that by using
>> > > +   vectorization instructions, then try with ARM unaligned access and
>> > > +   double-word store if profitable.  OPERANDS[0] is the destination,
>> > > +   OPERANDS[1] is the number of bytes, operands[2] is the value to
>> > > +   initialize the memory, OPERANDS[3] is the known alignment of the
>> > > +   destination.  */
>> > > +bool
>> > > +arm_gen_setmem (rtx *operands)
>> > > +{
>> > > +  rtx dstbase = operands[0];
>> > > +  unsigned HOST_WIDE_INT length;
>> > > +  unsigned HOST_WIDE_INT value;
>> > > +  unsigned HOST_WIDE_INT align;
>> > > +
>> > > +  if (!CONST_INT_P (operands[2]) || !CONST_INT_P (operands[1]))
>> > > +    return false;
>> > > +
>> > > +  length = UINTVAL (operands[1]);
>> > > +  if (length > 64)
>> > > +    return false;
>> > > +
>> > > +  value = (UINTVAL (operands[2]) & 0xFF);  align = UINTVAL
>> > > + (operands[3]);  if (TARGET_NEON && length >= 8
>> > > +      && current_tune->string_ops_prefer_neon
>> > > +      && arm_block_set_vect (dstbase, length, value, align))
>> > > +    return true;
>> > > +
>> > > +  if (!unaligned_access && (align & 3) != 0)
>> > > +    return arm_block_set_unaligned_straight (dstbase, length,
>> > > + value, align);
>> > > +
>> > > +  return arm_block_set_aligned_straight (dstbase, length, value,
>> > > +align); }
>> > > +
>> > >  /* Implement the TARGET_ASAN_SHADOW_OFFSET hook.  */
>> > >
>> > >  static unsigned HOST_WIDE_INT
>> > > Index: gcc/config/arm/arm-protos.h
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm-protos.h   (revision 209852)
>> > > +++ gcc/config/arm/arm-protos.h   (working copy)
>> > > @@ -277,6 +277,8 @@ struct tune_params
>> > >    /* Prefer 32-bit encoding instead of 16-bit encoding where subset
>> > > of
>> flags
>> > >       would be set.  */
>> > >    bool disparage_partial_flag_setting_t16_encodings;
>> > > +  /* Prefer to inline string operations like memset by using Neon.
>> > > + */  bool string_ops_prefer_neon;
>> > >  };
>> > >
>> > >  extern const struct tune_params *current_tune; @@ -289,6 +291,7 @@
>> > > extern void arm_emit_coreregs_64bit_shift (enum rt  extern bool
>> > > arm_validize_comparison (rtx *, rtx *, rtx *);  #endif /* RTX_CODE
>> > > */
>> > >
>> > > +extern bool arm_gen_setmem (rtx *);
>> > >  extern void arm_expand_vec_perm (rtx target, rtx op0, rtx op1, rtx
>> > > sel);  extern bool arm_expand_vec_perm_const (rtx target, rtx op0,
>> > > rtx op1, rtx sel);
>> > >
>> > > Index: gcc/config/arm/arm.md
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/config/arm/arm.md (revision 209852)
>> > > +++ gcc/config/arm/arm.md (working copy)
>> > > @@ -7555,6 +7555,20 @@
>> > >  })
>> > >
>> > >
>> > > +(define_expand "setmemsi"
>> > > +  [(match_operand:BLK 0 "general_operand" "")
>> > > +   (match_operand:SI 1 "const_int_operand" "")
>> > > +   (match_operand:SI 2 "const_int_operand" "")
>> > > +   (match_operand:SI 3 "const_int_operand" "")]
>> > > +  "TARGET_32BIT"
>> > > +{
>> > > +  if (arm_gen_setmem (operands))
>> > > +    DONE;
>> > > +
>> > > +  FAIL;
>> > > +})
>> > > +
>> > > +
>> > >  ;; Move a block of memory if it is word aligned and MORE than 2
>> > > words
>> > long.
>> > >  ;; We could let this apply for blocks of less than this, but it
>> > > clobbers so  ;; many registers that there is then probably a better
> way.
>> > > Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
>> > >
>> >
>> ==========================================================
>> > =========
>> > > --- gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> > > +++ gcc/testsuite/gcc.target/arm/memset-inline-6.c        (revision 0)
>> >
>> > Have you tested these when the compiler was configured with "--with-
>> > cpu=cortex-a9"?
>> Here is the tricky part.
>> For compiler configured with "--with-tune=cortex-a9", the neon related
>> cases
>> (4/5/6/8/9) would fail because we have no way to determine that we are
>> compiling with cortex-a9 tune here.

There is no easy way of fixing this.

>> For compiler configured with "--with-cpu=cortex-a9", the test cases would
>> pass but I think this is a mistake.  It reveals an issue that GCC won't
> pass "-
>> mcpu=cortex-a9" to cc1, resulting in cortex-a8 tune is selected.  It just
> makes
>> no sense.
>> With these issues, I didn't change the tests for now.

I think we'll just have to take the hit on noise in other configurations.

> Precisely, I configured gcc with options "--with-arch=armv7-a
> --with-cpu|--with-tune=cortex-a9".
> I read gcc documents and realized that "-mcpu" is ignored when "-march" is
> specified.  I don't know why gcc acts in this manner, but it leads to
> inconsistent configuration/command line behavior.
> If we configure GCC with "--with-arch=armv7-a --with-cpu=cortex-a9", then
> only "-march=armv7-a" is passed to cc1.

That kind of configuration is warned today but no one pays attention
to that. See James's proposal to promote this to an error.

> If we compile with "-march=armv7-a -mcpu=cortex-a9", then gcc works fine and
> passes "-march=armv7-a -mcpu=cortex-a9" to cc1.

That should be fine.

>
> Even more weird cc1 warns that "switch -mcpu=cortex-m4 conflicts with
> -march=armv7-m switch".


This is OK unless there are objections in the next 24 hours.

Please watch out for any fallout - Certainly rebase and retest before
applying and please post the rebased version for archival purposes.


regards
Ramana
>
> Thanks,
> bin
>
>
>
>

Re: [PATCH ARM] Improve ARM memset inlining

Reply via email to