Re: [SVE] [fwprop] PR88833 - Redundant moves for WHILELO-based loops

Richard Sandiford Wed, 26 Jun 2019 11:16:36 -0700

Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> On Wed, 26 Jun 2019 at 16:05, Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> > On Tue, 25 Jun 2019 at 20:05, Richard Sandiford
>> > <richard.sandif...@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> >> > On Mon, 24 Jun 2019 at 21:41, Prathamesh Kulkarni
>> >> > <prathamesh.kulka...@linaro.org> wrote:
>> >> >>
>> >> >> On Mon, 24 Jun 2019 at 19:51, Richard Sandiford
>> >> >> <richard.sandif...@arm.com> wrote:
>> >> >> >
>> >> >> > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
>> >> >> > > @@ -1415,6 +1460,19 @@ forward_propagate_into (df_ref use)
>> >> >> > >    if (!def_set)
>> >> >> > >      return false;
>> >> >> > >
>> >> >> > > +  if (reg_prop_only
>> >> >> > > +      && !REG_P (SET_SRC (def_set))
>> >> >> > > +      && !REG_P (SET_DEST (def_set)))
>> >> >> > > +    return false;
>> >> >> >
>> >> >> > This should be:
>> >> >> >
>> >> >> >   if (reg_prop_only
>> >> >> >       && (!REG_P (SET_SRC (def_set)) || !REG_P (SET_DEST (def_set))))
>> >> >> >     return false;
>> >> >> >
>> >> >> > so that we return false if either operand isn't a register.
>> >> >> Oops, sorry about that  -:(
>> >> >> >
>> >> >> > > +
>> >> >> > > +  /* Allow propagations into a loop only for reg-to-reg copies, 
>> >> >> > > since
>> >> >> > > +     replacing one register by another shouldn't increase the 
>> >> >> > > cost.  */
>> >> >> > > +
>> >> >> > > +  if (DF_REF_BB (def)->loop_father != DF_REF_BB (use)->loop_father
>> >> >> > > +      && !REG_P (SET_SRC (def_set))
>> >> >> > > +      && !REG_P (SET_DEST (def_set)))
>> >> >> > > +    return false;
>> >> >> >
>> >> >> > Same here.
>> >> >> >
>> >> >> > OK with that change, thanks.
>> >> >> Thanks for the review, will make the changes and commit the patch
>> >> >> after re-testing.
>> >> > Hi,
>> >> > Testing the patch showed following failures on 32-bit x86:
>> >> >
>> >> >   Executed from: g++.target/i386/i386.exp
>> >> >     g++:g++.target/i386/pr88152.C   scan-assembler-not 
>> >> > vpcmpgt|vpcmpeq|vpsra
>> >> >   Executed from: gcc.target/i386/i386.exp
>> >> >     gcc:gcc.target/i386/pr66768.c scan-assembler add*.[ \t]%gs:
>> >> >     gcc:gcc.target/i386/pr90178.c scan-assembler-times xorl[\\t
>> >> > ]*\\%eax,[\\t ]*%eax 1
>> >> >
>> >> > The failure of pr88152.C is also seen on x86_64.
>> >> >
>> >> > For pr66768.c, and pr90178.c, forwprop replaces register which is
>> >> > volatile and frame related respectively.
>> >> > To avoid that, the attached patch, makes a stronger constraint that
>> >> > src and dest should be a register
>> >> > and not have frame_related or volatil flags set, which is checked in
>> >> > usable_reg_p().
>> >> > Which avoids the failures for both the cases.
>> >> > Does it look OK ?
>> >>
>> >> That's not the reason it's a bad transform.  In both cases we're
>> >> propagating r2 <- r1 even though
>> >>
>> >> (a) r1 dies in the copy and
>> >> (b) fwprop can't replace all uses of r2, because some have multiple
>> >>     definitions
>> >>
>> >> This has the effect of making both values live in cases where only one
>> >> was previously.
>> >>
>> >> In the case of pr66768.c, fwprop2 is undoing the effect of
>> >> cse.c:canon_reg, which tries to pick the best register to use
>> >> (see cse.c:make_regs_eqv).  fwprop1 makes the same change,
>> >> and made it even before the patch, but the cse.c choice should win.
>> >>
>> >> A (hopefully) conservative fix would be to propagate the copy only if
>> >> both registers have a single definition, which you can test with:
>> >>
>> >>   (DF_REG_DEF_COUNT (regno) == 1
>> >>    && !bitmap_bit_p (DF_LR_OUT (ENTRY_BLOCK_PTR_FOR_FN (m_fn)), regno))
>> >>
>> >> In that case, fwprop should see all uses of the destination, and should
>> >> be able to replace it in all cases with the source.
>> > Ah I see, thanks for the explanation!
>> > The above check works to resolve the failure.
>> > IIUC, !bitmap_bit_p (...) above checks that reg isn't used uninitialized ?
>>
>> Right.
>>
>> >> > For g++.target/i386/pr88152.C, the issue is that after the patch,
>> >> > forwprop1 does following propagation (in f10) which wasn't done
>> >> > before:
>> >> >
>> >> > In insn 10, replacing
>> >> >  (unspec:SI [
>> >> >             (reg:V2DF 91)
>> >> >         ] UNSPEC_MOVMSK)
>> >> >  with (unspec:SI [
>> >> >             (subreg:V2DF (reg:V2DI 90) 0)
>> >> >         ] UNSPEC_MOVMSK)
>> >> >
>> >> > This later defeats combine because insn 9 gets deleted.
>> >> > Without patch, the following combination takes place:
>> >> >
>> >> > Trying 7 -> 9:
>> >> >     7: r90:V2DI=r89:V2DI>r93:V2DI
>> >> >       REG_DEAD r93:V2DI
>> >> >       REG_DEAD r89:V2DI
>> >> >     9: r91:V2DF=r90:V2DI#0
>> >> >       REG_DEAD r90:V2DI
>> >> > Successfully matched this instruction:
>> >> > (set (subreg:V2DI (reg:V2DF 91) 0)
>> >> >     (gt:V2DI (reg:V2DI 89)
>> >> >         (reg:V2DI 93)))
>> >> > allowing combination of insns 7 and 9
>> >> >
>> >> > and then:
>> >> > Trying 6, 9 -> 10:
>> >> >     6: r89:V2DI=const_vector
>> >> >     9: r91:V2DF#0=r89:V2DI>r93:V2DI
>> >> >       REG_DEAD r89:V2DI
>> >> >       REG_DEAD r93:V2DI
>> >> >    10: r87:SI=unspec[r91:V2DF] 43
>> >> >       REG_DEAD r91:V2DF
>> >> > Successfully matched this instruction:
>> >> > (set (reg:SI 87)
>> >> >     (unspec:SI [
>> >> >             (lt:V2DF (reg:V2DI 93)
>> >> >                 (const_vector:V2DI [
>> >> >                         (const_int 0 [0]) repeated x2
>> >> >                     ]))
>> >> >         ] UNSPEC_MOVMSK))
>> >>
>> >> Eh?  lt:*V2DF*?  Does that mean that it's 0 for false and an all-1 NaN
>> >> for true?
>> > Well looking at .optimized dump:
>> >
>> >   vector(2) long int _2;
>> >   vector(2) double _3;
>> >   int _6;
>> >   signed long _7;
>> >   long int _8;
>> >   signed long _9;
>> >   long int _10;
>> >
>> >   <bb 2> [local count: 1073741824]:
>> >   _7 = BIT_FIELD_REF <a_4(D), 64, 0>;
>> >   _8 = _7 < 0 ? -1 : 0;
>> >   _9 = BIT_FIELD_REF <a_4(D), 64, 64>;
>> >   _10 = _9 < 0 ? -1 : 0;
>> >   _2 = {_8, _10};
>> >   _3 = VIEW_CONVERT_EXPR<__m128d>(_2);
>> >   _6 = __builtin_ia32_movmskpd (_3); [tail call]
>> >   return _6;
>> >
>> > So AFAIU, we're using -1 for representing true and 0 for false
>> > and casting -1 (literally) to double would change it's value to -NaN ?
>>
>> Exactly.  And for -ffinite-math-only, we might as well then fold the
>> condition to false. :-)
>>
>> IMO rtl condition results have to have integral modes and not folding
>> (subreg:V2DF (lt:V2DI ...) 0) is the correct behaviour.  So I think this
>> is just a latent bug and shouldn't hold up the patch.
>>
>> I'm not sure whether:
>>
>>    reinterpret_cast<__m128d> (a > ...)
>>
>> is something we expect users to write, or whether it was just
>> included for completeness.
> I will report the issue and commit after re-testing patch.
> Thanks a lot for the helpful reviews!


Since it seems FP comparison results are OK, I guess it needs to be
fixed after all.  I think the problem is that the simplification
is only done by gen_lowpart_for_combine:

  /* If X is a comparison operator, rewrite it in a new mode.  This
     probably won't match, but may allow further simplifications.  */
  else if (COMPARISON_P (x))
    return gen_rtx_fmt_ee (GET_CODE (x), omode, XEXP (x, 0), XEXP (x, 1));

and triggers via expand_field_assignment when the subreg is on the lhs
of the SET.  For it to work for a general subreg on the rhs, it probably
needs to be moved from here to simplify_subreg.

We'll need to be careful about the conditions under which it
happens though.  The above doesn't for example check whether
the new mode has the same number of elements as the original,
so could generate things like:

   (gt:V4SF (reg:V2DI X) (reg:V2DI Y))

for:

   (set (reg:V2DI res) (gt:V2DI (reg:V2DI X) (reg:V2DI Y)))
   (subreg:V4SF (reg:V2DI res) 0)

I think most code would expect the result of a comparison to have the
same number of elements as the inputs and could ICE if they don't,
so I don't think it's up to the target to decide whether mismatches
are OK.  (But there again, I thought that last time. :-))

We'd also need to check the element sizes, since e.g. a lowpart
(subreg:V2HI ...) of a V2SI comparison isn't the same as a V2HI
comparison.

Thanks,
Richard

Re: [SVE] [fwprop] PR88833 - Redundant moves for WHILELO-based loops

Reply via email to