On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI <guih...@linux.ibm.com> wrote:
>
> Hi,
>   Add Hongtao Liu as the patch affects x86.
>
> 在 2024/8/20 6:32, Richard Sandiford 写道:
> > HAO CHEN GUI <guih...@linux.ibm.com> writes:
> >> Hi,
> >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> >> constant. So even a platform doesn't support vec_duplicate, it could
> >> still do clear by pieces if it supports const0 move by that mode.
> >>
> >>   Compared to the previous version, the main change is to set up a
> >> new function to generate const0 for certain modes and use the function
> >> as by_pieces_constfn for CLEAR_BY_PIECES.
> >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> >>
> >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> >> regressions.
> >>
> >>   On i386, it got several regressions. One issue is the predicate of
> >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> >> for clear by pieces with the patch. The second issue is the const0 is
> >> passed directly to the move expand with the patch. Originally it is
> >> forced to a pseudo and i386 can leverage the previous data to do
> >> optimization.
> >
> > The patch looks good to me, but I suppose we'll need to decide what
> > to do about x86.
> >
> > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > source, especially since ix86_expand_vector_mode does have code to
> > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > predicates anyway, so the predicate will have little effect.
> >
> > A workaround would be to check legitimate_constant_p instead of the
> > predicate, but I'm not sure that that should be necessary.
> >
> > Has this already been discussed?  If not, we should loop in the x86
> > maintainers (but I didn't do that here in case it would be a repeat).
>
> I also noticed it. Not sure why movv16qi requires a
> nonimmediate_operand, while ix86_expand_vector_mode could deal with
> constant op. Looking forward to Hongtao's comments.
The code has been there since 2005 before I'm involved.
 It looks to me at the beginning both mov<mode> and
*mov<mode>_internal only support nonimmediate_operand for the
operands[1].
And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
nonimmediate_or_sse_const_operand for *mov<mode>_internal, but not for
mov<mode>.
I think we can align the predicate between mov<mode> and *mov<mode>_internal.
I'll do some tests and reach back to you.
>
> >
> > As far as the second issue goes, I suppose there are at least three
> > ways of handling shared constants:
> >
> > (1) Force the zero into a register and leave later optimisations to
> >     propagate the zero where profitable.
> The zero can be propagated into the store, but the address adjustment
> may not be combined into insn properly. For instance, if zero is
> forced to a register, "movv2x8qi" insn is generated. The address
> adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> d-from address. When zero is propagated, it converts "movv2x8qi" to
> "movti". "movti" supports d-from as well as post/inc address. Probably,
> the auto_inc_dec pass combines address adjustment insn into previous
> "movti" to generate a post inc "movti". The expected optimization might
> be to combine address adjustment insn into second "movit" and generate a
> d-form "movti". It's a regression issue I found in aarch64.
>
> Also we checks if const0 is supported for mov optab. But finally we
> force the const0 to a register and generate a store with the register.
> Seems it's not reasonable.
>
> >
> > (2) Emit stores of zero and expect a later pass to share constants
> >     where beneficial.
> Not sure which pass can optimize it.
>
> >
> > (3) Generate stores of zero and leave the target expanders to force
> >     constants into registers on the fly if reuse seems plausibly
> >     beneficial.
> >
> The constant zero with different modes are not relevant. Not sure
> which pass can optimize it. The compiler should be taught that
> reg 102 can be expressed as a subreg of reg 100.
>
> (insn 6 5 7 2 (set (reg:V32QI 100)
>         (const_vector:V32QI [
>                 (const_int 0 [0]) repeated x32
>             ]))
>
> (insn 8 7 0 2 (set (reg:V16QI 102)
>         (const_vector:V16QI [
>                 (const_int 0 [0]) repeated x16
>             ]))
>
It's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080, I have some
experimental patch which tries to eliminate those redundancy in cse.

> I tested a case with one 32-byte and one 16-byte memory clear on x86.
> These two sets can't be optimized. They can be optimized only when they
> are in the same memory clear operation (for example, a 48-byte memory
> clear).
>
> Thanks
> Gui Haochen
>
> > where (3) is a middle ground between (1) and (2).
> >
> > Thanks,
> > Richard
> >
> >>   The patch also raises several regressions on aarch64. The V2x8QImode
> >> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> >> supports const0 and vector mode is preferable. I drafted a patch to
> >> address the issue. It will be sent for review in a separate email.
> >> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> >> It seems cause different sequences of instructions but the actually
> >> instructions are the same.
> >>
> >> Thanks
> >> Gui Haochen
> >>
> >> ChangeLog
> >> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
> >>
> >> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> >> constant.  So even a platform doesn't support vec_duplicate, it could
> >> still do clear by pieces if it supports const0 move.  This patch adds
> >> the checking.
> >>
> >> gcc/
> >>      * expr.cc (by_pieces_mode_supported_p): Add const0 move checking
> >>      for CLEAR_BY_PIECES.
> >>      (set_zero): New.
> >>      (clear_by_pieces): Pass set_zero as by_pieces_constfn.
> >>
> >> patch.diff
> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> >> index ffbac513692..7199e0956f8 100644
> >> --- a/gcc/expr.cc
> >> +++ b/gcc/expr.cc
> >> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
> >>  static bool
> >>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
> >>  {
> >> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> >> +  enum insn_code icode = optab_handler (mov_optab, mode);
> >> +  if (icode == CODE_FOR_nothing)
> >>      return false;
> >>
> >> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> >> +  if (op == SET_BY_PIECES
> >>        && VECTOR_MODE_P (mode)
> >>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
> >>      return false;
> >>
> >> +  if (op == CLEAR_BY_PIECES
> >> +      && VECTOR_MODE_P (mode)
> >> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> >> +   return false;
> >> +
> >>    if (op == COMPARE_BY_PIECES
> >>        && !can_compare_p (EQ, mode, ccp_jump))
> >>      return false;
> >> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT 
> >> len,
> >>      return to;
> >>  }
> >>
> >> +static rtx
> >> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> >> +{
> >> +  return CONST0_RTX (mode);
> >> +}
> >> +
> >>  void
> >>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
> >>  {
> >>    if (len == 0)
> >>      return;
> >>
> >> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> >> -  char c = 0;
> >> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> >> -                      CLEAR_BY_PIECES);
> >> +  /* Use set_zero to generate const0 of centain mode.  */
> >> +  store_by_pieces_d data (to, set_zero, NULL, len, align, 
> >> CLEAR_BY_PIECES);
> >>    data.run ();
> >>  }



-- 
BR,
Hongtao

Reply via email to