Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

Hongtao Liu Tue, 20 Aug 2024 20:08:38 -0700

On Tue, Aug 20, 2024 at 2:50 PM Hongtao Liu <crazy...@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI <guih...@linux.ibm.com> wrote:
> >
> > Hi,
> >   Add Hongtao Liu as the patch affects x86.
> >
> > 在 2024/8/20 6:32, Richard Sandiford 写道:
> > > HAO CHEN GUI <guih...@linux.ibm.com> writes:
> > >> Hi,
> > >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> > >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> > >> constant. So even a platform doesn't support vec_duplicate, it could
> > >> still do clear by pieces if it supports const0 move by that mode.
> > >>
> > >>   Compared to the previous version, the main change is to set up a
> > >> new function to generate const0 for certain modes and use the function
> > >> as by_pieces_constfn for CLEAR_BY_PIECES.
> > >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> > >>
> > >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> > >> regressions.
> > >>
> > >>   On i386, it got several regressions. One issue is the predicate of
> > >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> > >> for clear by pieces with the patch. The second issue is the const0 is
> > >> passed directly to the move expand with the patch. Originally it is
> > >> forced to a pseudo and i386 can leverage the previous data to do
> > >> optimization.
> > >
> > > The patch looks good to me, but I suppose we'll need to decide what
> > > to do about x86.
> > >
> > > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > > source, especially since ix86_expand_vector_mode does have code to
> > > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > > predicates anyway, so the predicate will have little effect.
> > >
> > > A workaround would be to check legitimate_constant_p instead of the
> > > predicate, but I'm not sure that that should be necessary.
> > >
> > > Has this already been discussed?  If not, we should loop in the x86
> > > maintainers (but I didn't do that here in case it would be a repeat).
> >
> > I also noticed it. Not sure why movv16qi requires a
> > nonimmediate_operand, while ix86_expand_vector_mode could deal with
> > constant op. Looking forward to Hongtao's comments.
> The code has been there since 2005 before I'm involved.
>  It looks to me at the beginning both mov<mode> and
> *mov<mode>_internal only support nonimmediate_operand for the
> operands[1].
> And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
> nonimmediate_or_sse_const_operand for *mov<mode>_internal, but not for
> mov<mode>.
> I think we can align the predicate between mov<mode> and *mov<mode>_internal.
> I'll do some tests and reach back to you.
r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
rebase your patch and see if there's still the regressions.
> >
> > >
> > > As far as the second issue goes, I suppose there are at least three
> > > ways of handling shared constants:
> > >
> > > (1) Force the zero into a register and leave later optimisations to
> > >     propagate the zero where profitable.
> > The zero can be propagated into the store, but the address adjustment
> > may not be combined into insn properly. For instance, if zero is
> > forced to a register, "movv2x8qi" insn is generated. The address
> > adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> > d-from address. When zero is propagated, it converts "movv2x8qi" to
> > "movti". "movti" supports d-from as well as post/inc address. Probably,
> > the auto_inc_dec pass combines address adjustment insn into previous
> > "movti" to generate a post inc "movti". The expected optimization might
> > be to combine address adjustment insn into second "movit" and generate a
> > d-form "movti". It's a regression issue I found in aarch64.
> >
> > Also we checks if const0 is supported for mov optab. But finally we
> > force the const0 to a register and generate a store with the register.
> > Seems it's not reasonable.
> >
> > >
> > > (2) Emit stores of zero and expect a later pass to share constants
> > >     where beneficial.
> > Not sure which pass can optimize it.
> >
> > >
> > > (3) Generate stores of zero and leave the target expanders to force
> > >     constants into registers on the fly if reuse seems plausibly
> > >     beneficial.
> > >
> > The constant zero with different modes are not relevant. Not sure
> > which pass can optimize it. The compiler should be taught that
> > reg 102 can be expressed as a subreg of reg 100.
> >
> > (insn 6 5 7 2 (set (reg:V32QI 100)
> >         (const_vector:V32QI [
> >                 (const_int 0 [0]) repeated x32
> >             ]))
> >
> > (insn 8 7 0 2 (set (reg:V16QI 102)
> >         (const_vector:V16QI [
> >                 (const_int 0 [0]) repeated x16
> >             ]))
> >
> It's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080, I have some
> experimental patch which tries to eliminate those redundancy in cse.
>
> > I tested a case with one 32-byte and one 16-byte memory clear on x86.
> > These two sets can't be optimized. They can be optimized only when they
> > are in the same memory clear operation (for example, a 48-byte memory
> > clear).
> >
> > Thanks
> > Gui Haochen
> >
> > > where (3) is a middle ground between (1) and (2).
> > >
> > > Thanks,
> > > Richard
> > >
> > >>   The patch also raises several regressions on aarch64. The V2x8QImode
> > >> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> > >> supports const0 and vector mode is preferable. I drafted a patch to
> > >> address the issue. It will be sent for review in a separate email.
> > >> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> > >> It seems cause different sequences of instructions but the actually
> > >> instructions are the same.
> > >>
> > >> Thanks
> > >> Gui Haochen
> > >>
> > >> ChangeLog
> > >> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
> > >>
> > >> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> > >> constant.  So even a platform doesn't support vec_duplicate, it could
> > >> still do clear by pieces if it supports const0 move.  This patch adds
> > >> the checking.
> > >>
> > >> gcc/
> > >>      * expr.cc (by_pieces_mode_supported_p): Add const0 move checking
> > >>      for CLEAR_BY_PIECES.
> > >>      (set_zero): New.
> > >>      (clear_by_pieces): Pass set_zero as by_pieces_constfn.
> > >>
> > >> patch.diff
> > >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> > >> index ffbac513692..7199e0956f8 100644
> > >> --- a/gcc/expr.cc
> > >> +++ b/gcc/expr.cc
> > >> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
> > >>  static bool
> > >>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation 
> > >> op)
> > >>  {
> > >> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> > >> +  enum insn_code icode = optab_handler (mov_optab, mode);
> > >> +  if (icode == CODE_FOR_nothing)
> > >>      return false;
> > >>
> > >> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> > >> +  if (op == SET_BY_PIECES
> > >>        && VECTOR_MODE_P (mode)
> > >>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
> > >>      return false;
> > >>
> > >> +  if (op == CLEAR_BY_PIECES
> > >> +      && VECTOR_MODE_P (mode)
> > >> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> > >> +   return false;
> > >> +
> > >>    if (op == COMPARE_BY_PIECES
> > >>        && !can_compare_p (EQ, mode, ccp_jump))
> > >>      return false;
> > >> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT 
> > >> len,
> > >>      return to;
> > >>  }
> > >>
> > >> +static rtx
> > >> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> > >> +{
> > >> +  return CONST0_RTX (mode);
> > >> +}
> > >> +
> > >>  void
> > >>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
> > >>  {
> > >>    if (len == 0)
> > >>      return;
> > >>
> > >> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> > >> -  char c = 0;
> > >> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> > >> -                      CLEAR_BY_PIECES);
> > >> +  /* Use set_zero to generate const0 of centain mode.  */
> > >> +  store_by_pieces_d data (to, set_zero, NULL, len, align, 
> > >> CLEAR_BY_PIECES);
> > >>    data.run ();
> > >>  }
>
>
>
> --
> BR,
> Hongtao




-- 
BR,
Hongtao

Re: [PATCHv4, expand] Add const0 move checking for CLEAR_BY_PIECES optabs

Reply via email to