On Thu, Sep 3, 2015 at 10:59 PM, Wilco Dijkstra <wdijk...@arm.com> wrote: >> Segher Boessenkool wrote: >> On Thu, Sep 03, 2015 at 12:43:34PM +0100, Wilco Dijkstra wrote: >> > > > Combine canonicalizes certain AND masks in a comparison with zero into >> > > > extracts of the >> > > widest >> > > > register type. During matching these are expanded into a very >> > > > inefficient sequence that >> > > fails to >> > > > match. For example (x & 2) == 0 is matched in combine like this: >> > > > >> > > > Failed to match this instruction: >> > > > (set (reg:CC 66 cc) >> > > > (compare:CC (zero_extract:DI (subreg:DI (reg/v:SI 76 [ xD.2641 ]) >> > > > 0) >> > > > (const_int 1 [0x1]) >> > > > (const_int 1 [0x1])) >> > > > (const_int 0 [0]))) >> > > >> > > Yes. Some processors even have specific instructions to do this. >> > >> > However there are 2 issues with this, one is the spurious subreg, >> >> Combine didn't make that up out of thin air; something already used >> DImode here. It could simplify it to SImode in this case, that is >> true, don't know why it doesn't; it isn't necessarily faster code to >> do so, it can be slower, it might not match, etc. > > The relevant RTL instructions on AArch64 are: > > (insn 8 3 25 2 (set (reg:SI 77 [ D.2705 ]) > (and:SI (reg/v:SI 76 [ xD.2641 ]) > (const_int 2 [0x2]))) tmp5.c:122 452 {andsi3} > (nil)) > (insn 26 25 27 2 (set (reg:CC 66 cc) > (compare:CC (reg:SI 77 [ D.2705 ]) > (const_int 0 [0]))) tmp5.c:122 377 {*cmpsi} > (expr_list:REG_DEAD (reg:SI 77 [ D.2705 ]) > (nil))) > > I don't see anything using DI... > >> > the other is >> > that only a subset of legal zero_extracts are tried (only single bit and >> > the >> > degenerate case of zero_extract with shift of 0 - which I think should not >> > be a >> > zero_extract). All other AND immediate remain as AND. >> >> Yes. I'm happy to see this weird special case "optimisation", >> "canocalisation" gone. >> >> > So to emit an AND on targets without such specific instructions, or where >> > such >> > instructions are more expensive than a simple AND (*), you need now at >> > least 3 different >> > backend patterns for any instruction that can emit AND immediate... >> >> It's only a problem for AND-and-compare, no? > > Yes, so it looks like some other backends match the odd pattern and then have > another > pattern change it back into the canonical AND/TST form during the split phase > (maybe > the subreg confuses register allocation or block other optimizations). This > all seems > a lot of unnecessary complexity for a few special immediates when there is a > much > simpler solution... > >> > (*) I think that is another issue in combine - when both alternatives >> > match you >> > want to select the lowest cost one, not the first one that matches. >> >> That's recog, not combine. And quite a few backends rely on "first match >> wins", because it always has been that way. It also is very easy to write >> such patterns accidentally (sometimes even the exact same one twice in the >> same machine description, etc.) >> >> > > > Failed to match this instruction: >> > > > (set (reg:CC 66 cc) >> > > > (compare:CC (and:DI (lshiftrt:DI (subreg:DI (reg/v:SI 76 [ xD.2641 >> > > > ]) 0) >> > > > (const_int 1 [0x1])) >> > > > (const_int 1 [0x1])) >> > > > (const_int 0 [0]))) >> > > >> > > This is after r223067. Combine tests only one "final" instruction; that >> > > revision rewrites zero_ext* if it doesn't match and tries again. This >> > > helps for processors that can do more generic masks (like rs6000, and I >> > > believe also aarch64?): without it, you need many more patterns to match >> > > all the zero_ext{ract,end} cases. >> > >> > But there are more efficient ways to emit single bit and masks tests that >> > apply >> > to most CPUs rather than doing something specific that works for just one >> > target >> > only. For example single bit test is a simple shift into carry flag or >> > into the >> > sign bit, and for mask tests, you shift out all the non-mask bits. >> >> Most of those are quite target-specific. Some others are already done, >> and/or done by other passes. > > But what combine does here is even more target-specific. Shifts and AND > setting flags > are universally supported on targets with condition code register. > Bitfield test/extract instructions are more rare, and when they are > supported, they > may well be more expensive. > >> > So my question is, is it combine's job to try all possible permutations >> > that >> > constitute a bit or mask test? >> >> Combine converts the merged instructions to what it thinks is the >> canonical or cheapest form, and uses that. It does not try multiple >> options (the zero_ext* -> and+shift rewriting is not changing the >> semantics of the pattern at all). > > But the change from AND to zero_extract is already changing semantics... > >> > Or would it be better to let each target decide >> > on how to canonicalize bit tests and only try that alternative? >> >> The question is how to write the pattern to be most convenient for all >> targets. > > The obvious choice is to try the 2 original instructions merged. > >> > > > Neither matches the AArch64 patterns for ANDS/TST (which is just >> > > > compare and AND). If >> the >> > > immediate >> > > > is not a power of 2 or a power of 2 -1 then it matches correctly as >> > > > expected. >> > > > >> > > > I don't understand how ((x >> 1) & 1) != 0 could be a useful expansion >> > > >> > > It is zero_extract(x,1,1) really. This is convenient for (old and >> > > embedded) >> > > processors that have special bit-test instructions. If we now want >> > > combine >> > > to not do this, we'll have to update all backends that rely on it. >> > >> > Would any backend actually rely on this given it only does some specific >> > masks, >> > has a redundant shift with 0 for the mask case and the odd subreg as well? >> >> Such backends match the zero_extract patterns, of course. Random example: >> the h8300 patterns for the "btst" instruction. >> >> > > They are common, and many processors had instructions for them, which is >> > > (supposedly) why combine special-cased them. >> > >> > Yes, but that doesn't mean (x & C) != 0 shouldn't be tried as well... >> >> Combine does not try multiple options. > > I'm not following - combine tries zero_extract and shift+AND - that's 2 > options. > If that is feasible then adding a 3rd option should be possible. > >> > > > It's trivial to change change_zero_ext to expand extracts always into >> > > > AND and remove the >> > > redundant >> > > > subreg. >> > > >> > > Not really trivial... Think about comparisons... >> > > >> > > int32_t x; >> > > ((x >> 31) & 1) > 0; >> > > // is not the same as >> > > (x & 0x80000000) > 0; // signed comparison >> > > >> > > (You do not easily know what the COMPARE is used for). >> > >> > Indeed if you had a zero_extract that included the sign-bit then you may >> > have to adjust >> > the compare condition. However it looks like that case can't happen - (x & >> > 0x80000000) >> > comparisons with have the AND optimized away much earlier. >> >> A long time before combine, yes. But it is *possible* to give such insns >> to combine, and it should not do the wrong thing with them. > > Yes. So that suggests we'd need to block the canonicalization rather than > undo it. > >> > > > However wouldn't it make more sense to never special case certain AND >> > > > immediate in the >> first >> > > > place? >> > > >> > > Yes, but we need to make sure no targets regress (i.e. by rewriting >> > > patterns >> > > for those that do). We need to know the impact of such a change, at the >> > > least. >> > >> > The alternative would be let the target decide how to canonicalize bit >> > tests. >> > That seems like a better solution than trying multiple possibilities which >> > can never >> > match on most targets. >> >> You will end up with a *lot* of target hooks like this. It will also >> make testing harder (less coverage). I am not sure that is a good idea. > > We certainly need a lot more target hooks in general so GCC can do the right > thing > (rather than using costs inconsistently all over the place). But that's a > different > discussion...
This is the pattern which I have in my tree (I need someone here to submit it still): (define_insn "*and<mode>3_ze_nr_compare0" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ (zero_extract:GPI ; loc size pos (match_operand:GPI 0 "register_operand" "r") (match_operand:GPI 1 "const_int_operand" "n") (match_operand:GPI 2 "const_int_operand" "n")) (const_int 0)))] "aarch64_bitmask_imm ((((HOST_WIDE_INT_1U << UINTVAL (operands[1])) - 1) << UINTVAL (operands[2])),<MODE>mode)" { unsigned HOST_WIDE_INT value = (((HOST_WIDE_INT_1U << UINTVAL (operands[1])) - 1) << UINTVAL (operands[2])); operands[1] = GEN_INT (value); return "tst\\t%<w>0, %<w>1"; } [(set_attr "type" "logics_reg")] ) --- CUT --- And then under case COMPARE part of rtx_costs: /* CC_NZmode supports zero extract for free. */ if (GET_MODE (x) == CC_NZmode && GET_CODE (op0) == ZERO_EXTRACT) op0 = XEXP (op0, 0); And in aarch64_select_cc_mode: if ((GET_MODE (x) == SImode || GET_MODE (x) == DImode) && y == const0_rtx && (code == EQ || code == NE || code == LT || code == GE) && (GET_CODE (x) == PLUS || GET_CODE (x) == MINUS || GET_CODE (x) == AND || GET_CODE (x) == NEG || GET_CODE (x) == ZERO_EXTRACT)) return CC_NZmode; Note I also have the following too, to solve the case where zero_extend is need being used in some cases for ands (yes this shows up mostly right after returns from functions) (define_insn "*andsi3_compare0_uxtw" [(set (reg:CC_NZ CC_REGNUM) (compare:CC_NZ (and:SI (match_operand:SI 1 "register_operand" "%r,r") (match_operand:SI 2 "aarch64_logical_operand" "r,K")) (const_int 0))) (set (match_operand:DI 0 "register_operand" "=r,r") (zero_extend:DI (and:SI (match_dup 1) (match_dup 2))))] "" "ands\\t%w0, %w1, %w2" [(set_attr "type" "logics_reg,logics_imm")] Thanks, Andrew Pinski > > Wilco > >