Re: RFC: Combine of compare & and oddity

Andrew Pinski Thu, 03 Sep 2015 08:14:44 -0700

On Thu, Sep 3, 2015 at 10:59 PM, Wilco Dijkstra <wdijk...@arm.com> wrote:
>> Segher Boessenkool wrote:
>> On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote:
>> > > > Combine canonicalizes certain AND masks in a comparison with zero into 
>> > > > extracts of the
>> > > widest
>> > > > register type. During matching these are expanded into a very 
>> > > > inefficient sequence that
>> > > fails to
>> > > > match. For example (x & 2) == 0 is matched in combine like this:
>> > > >
>> > > > Failed to match this instruction:
>> > > > (set (reg:CC 66 cc)
>> > > >     (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) 
>> > > > 0)
>> > > >             (const_int 1 [0x1])
>> > > >             (const_int 1 [0x1]))
>> > > >         (const_int 0 [0])))
>> > >
>> > > Yes.  Some processors even have specific instructions to do this.
>> >
>> > However there are 2 issues with this, one is the spurious subreg,
>>
>> Combine didn't make that up out of thin air; something already used
>> DImode here.  It could simplify it to SImode in this case, that is
>> true, don't know why it doesn't; it isn't necessarily faster code to
>> do so, it can be slower, it might not match, etc.
>
> The relevant RTL instructions on AArch64 are:
>
> (insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ])
>         (and:SI (reg/v:SI 76 [ xD.2641 ])
>             (const_int 2 [0x2]))) tmp5.c:122 452 {andsi3}
>      (nil))
>  (insn 26 25 27 2 (set (reg:CC 66 cc)
>         (compare:CC (reg:SI 77 [ D.2705 ])
>             (const_int 0 [0]))) tmp5.c:122 377 {*cmpsi}
>      (expr_list:REG_DEAD (reg:SI 77 [ D.2705 ])
>         (nil)))
>
> I don't see anything using DI...
>
>> > the other is
>> > that only a subset of legal zero_extracts are tried (only single bit and 
>> > the
>> > degenerate case of zero_extract with shift of 0 - which I think should not 
>> > be a
>> > zero_extract). All other AND immediate remain as AND.
>>
>> Yes.  I'm happy to see this weird special case "optimisation",
>> "canocalisation" gone.
>>
>> > So to emit an AND on targets without such specific instructions, or where 
>> > such
>> > instructions are more expensive than a simple AND (*), you need now at 
>> > least 3 different
>> > backend patterns for any instruction that can emit AND immediate...
>>
>> It's only a problem for AND-and-compare, no?
>
> Yes, so it looks like some other backends match the odd pattern and then have 
> another
> pattern change it back into the canonical AND/TST form during the split phase 
> (maybe
> the subreg confuses register allocation or block other optimizations). This 
> all seems
> a lot of unnecessary complexity for a few special immediates when there is a 
> much
> simpler solution...
>
>> > (*) I think that is another issue in combine - when both alternatives 
>> > match you
>> > want to select the lowest cost one, not the first one that matches.
>>
>> That's recog, not combine.  And quite a few backends rely on "first match
>> wins", because it always has been that way.  It also is very easy to write
>> such patterns accidentally (sometimes even the exact same one twice in the
>> same machine description, etc.)
>>
>> > > > Failed to match this instruction:
>> > > > (set (reg:CC 66 cc)
>> > > >     (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641 
>> > > > ]) 0)
>> > > >                 (const_int 1 [0x1]))
>> > > >             (const_int 1 [0x1]))
>> > > >         (const_int 0 [0])))
>> > >
>> > > This is after r223067.  Combine tests only one "final" instruction; that
>> > > revision rewrites zero_ext* if it doesn't match and tries again.  This
>> > > helps for processors that can do more generic masks (like rs6000, and I
>> > > believe also aarch64?): without it, you need many more patterns to match
>> > > all the zero_ext{ract,end} cases.
>> >
>> > But there are more efficient ways to emit single bit and masks tests that 
>> > apply
>> > to most CPUs rather than doing something specific that works for just one 
>> > target
>> > only. For example single bit test is a simple shift into carry flag or 
>> > into the
>> > sign bit, and for mask tests, you shift out all the non-mask bits.
>>
>> Most of those are quite target-specific.  Some others are already done,
>> and/or done by other passes.
>
> But what combine does here is even more target-specific. Shifts and AND 
> setting flags
> are universally supported on targets with condition code register.
> Bitfield test/extract instructions are more rare, and when they are 
> supported, they
> may well be more expensive.
>
>> > So my question is, is it combine's job to try all possible permutations 
>> > that
>> > constitute a bit or mask test?
>>
>> Combine converts the merged instructions to what it thinks is the
>> canonical or cheapest form, and uses that.  It does not try multiple
>> options (the zero_ext* -> and+shift rewriting is not changing the
>> semantics of the pattern at all).
>
> But the change from AND to zero_extract is already changing semantics...
>
>> > Or would it be better to let each target decide
>> > on how to canonicalize bit tests and only try that alternative?
>>
>> The question is how to write the pattern to be most convenient for all
>> targets.
>
> The obvious choice is to try the 2 original instructions merged.
>
>> > > > Neither matches the AArch64 patterns for ANDS/TST (which is just 
>> > > > compare and AND). If
>> the
>> > > immediate
>> > > > is not a power of 2 or a power of 2 -1 then it matches correctly as 
>> > > > expected.
>> > > >
>> > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion
>> > >
>> > > It is zero_extract(x,1,1) really.  This is convenient for (old and 
>> > > embedded)
>> > > processors that have special bit-test instructions.  If we now want 
>> > > combine
>> > > to not do this, we'll have to update all backends that rely on it.
>> >
>> > Would any backend actually rely on this given it only does some specific 
>> > masks,
>> > has a redundant shift with 0 for the mask case and the odd subreg as well?
>>
>> Such backends match the zero_extract patterns, of course.  Random example:
>> the h8300 patterns for the "btst" instruction.
>>
>> > > They are common, and many processors had instructions for them, which is
>> > > (supposedly) why combine special-cased them.
>> >
>> > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well...
>>
>> Combine does not try multiple options.
>
> I'm not following - combine tries zero_extract and shift+AND - that's 2 
> options.
> If that is feasible then adding a 3rd option should be possible.
>
>> > > > It's trivial to change change_zero_ext to expand extracts always into 
>> > > > AND and remove the
>> > > redundant
>> > > > subreg.
>> > >
>> > > Not really trivial...  Think about comparisons...
>> > >
>> > > int32_t x;
>> > > ((x >> 31) & 1) > 0;
>> > > // is not the same as
>> > > (x & 0x80000000) > 0; // signed comparison
>> > >
>> > > (You do not easily know what the COMPARE is used for).
>> >
>> > Indeed if you had a zero_extract that included the sign-bit then you may 
>> > have to adjust
>> > the compare condition. However it looks like that case can't happen - (x & 
>> > 0x80000000)
>> > comparisons with have the AND optimized away much earlier.
>>
>> A long time before combine, yes.  But it is *possible* to give such insns
>> to combine, and it should not do the wrong thing with them.
>
> Yes. So that suggests we'd need to block the canonicalization rather than 
> undo it.
>
>> > > > However wouldn't it make more sense to never special case certain AND 
>> > > > immediate in the
>> first
>> > > > place?
>> > >
>> > > Yes, but we need to make sure no targets regress (i.e. by rewriting 
>> > > patterns
>> > > for those that do).  We need to know the impact of such a change, at the 
>> > > least.
>> >
>> > The alternative would be let the target decide how to canonicalize bit 
>> > tests.
>> > That seems like a better solution than trying multiple possibilities which 
>> > can never
>> > match on most targets.
>>
>> You will end up with a *lot* of target hooks like this.  It will also
>> make testing harder (less coverage).  I am not sure that is a good idea.
>
> We certainly need a lot more target hooks in general so GCC can do the right 
> thing
> (rather than using costs inconsistently all over the place). But that's a 
> different
> discussion...



This is the pattern which I have in my tree (I need someone here to
submit it still):

(define_insn "*and<mode>3_ze_nr_compare0"
  [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
(zero_extract:GPI  ; loc size pos
                  (match_operand:GPI 0 "register_operand" "r")
                  (match_operand:GPI 1 "const_int_operand" "n")
 (match_operand:GPI 2 "const_int_operand" "n"))
(const_int 0)))]
  "aarch64_bitmask_imm ((((HOST_WIDE_INT_1U << UINTVAL (operands[1]))
- 1) << UINTVAL (operands[2])),<MODE>mode)"
  {
    unsigned HOST_WIDE_INT value  = (((HOST_WIDE_INT_1U << UINTVAL
(operands[1])) - 1) << UINTVAL (operands[2]));
    operands[1] = GEN_INT (value);
    return "tst\\t%<w>0, %<w>1";
  }
  [(set_attr "type" "logics_reg")]
)

--- CUT ---

And then under case COMPARE part of rtx_costs:
          /* CC_NZmode supports zero extract for free.  */
          if (GET_MODE (x) == CC_NZmode && GET_CODE (op0) == ZERO_EXTRACT)
            op0 = XEXP (op0, 0);

And in aarch64_select_cc_mode:

  if ((GET_MODE (x) == SImode || GET_MODE (x) == DImode)
      && y == const0_rtx
      && (code == EQ || code == NE || code == LT || code == GE)
      && (GET_CODE (x) == PLUS || GET_CODE (x) == MINUS || GET_CODE (x) == AND
 || GET_CODE (x) == NEG || GET_CODE (x) == ZERO_EXTRACT))
    return CC_NZmode;


Note I also have the following too, to solve the case where
zero_extend is need being used in some cases for ands (yes this shows
up mostly right after returns from functions)

(define_insn "*andsi3_compare0_uxtw"
  [(set (reg:CC_NZ CC_REGNUM)
(compare:CC_NZ
(and:SI (match_operand:SI 1 "register_operand" "%r,r")
(match_operand:SI 2 "aarch64_logical_operand" "r,K"))
(const_int 0)))
   (set (match_operand:DI 0 "register_operand" "=r,r")
(zero_extend:DI (and:SI (match_dup 1) (match_dup 2))))]
  ""
  "ands\\t%w0, %w1, %w2"
  [(set_attr "type" "logics_reg,logics_imm")]


Thanks,
Andrew Pinski




>
> Wilco
>
>

Re: RFC: Combine of compare & and oddity

Reply via email to