On 10 June 2016 at 10:38, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > > On 09/06/16 13:14, Christophe Lyon wrote: >> >> On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> >> wrote: >>> >>> On 07/06/16 20:34, Christophe Lyon wrote: >>>> >>>> On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> >>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> In this PR we want to optimise: >>>>> int foo (int i) >>>>> { >>>>> return (i == 0) ? N : __builtin_clz (i); >>>>> } >>>>> >>>>> on targets where CLZ is defined at zero to the constant 'N'. >>>>> This is determined at the RTL level through the >>>>> CLZ_DEFINED_VALUE_AT_ZERO >>>>> macro. >>>>> The obvious place to implement this would be in combine through >>>>> simplify-rtx >>>>> where we'd >>>>> recognise an IF_THEN_ELSE of the form: >>>>> (set (reg:SI r1) >>>>> (if_then_else:SI (ne (reg:SI r2) >>>>> (const_int 0 [0])) >>>>> (clz:SI (reg:SI r2)) >>>>> (const_int 32))) >>>>> >>>>> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd >>>>> simplify >>>>> it into >>>>> just (clz:SI (reg:SI r2)). >>>>> However, I found this doesn't quite happen for a couple of reasons: >>>>> 1) This depends on ifcvt or some other pass to have created a >>>>> conditional >>>>> move of the >>>>> two branches that provide the IF_THEN_ELSE to propagate the const_int >>>>> and >>>>> clz operation into. >>>>> >>>>> 2) Combine will refuse to propagate r2 from the above example into both >>>>> the >>>>> condition and the >>>>> CLZ at the same time, so the most we see is: >>>>> (set (reg:SI r1) >>>>> (if_then_else:SI (ne (reg:CC cc) >>>>> (const_int 0)) >>>>> (clz:SI (reg:SI r2)) >>>>> (const_int 32))) >>>>> >>>>> which is not enough information to perform the simplification. >>>>> >>>>> This patch implements the optimisation in ce1 using the noce ifcvt >>>>> framework. >>>>> During ifcvt noce_process_if_block can see that we're trying to >>>>> optimise >>>>> something >>>>> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of >>>>> all >>>>> the information >>>>> needed to perform the transformation. >>>>> >>>>> The transformation is performed by adding a new noce_try* function that >>>>> tries to put the >>>>> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and >>>>> try >>>>> to >>>>> simplify that >>>>> using the simplify-rtx machinery. That way, we can implement the >>>>> simplification logic in >>>>> simplify-rtx.c where it belongs. >>>>> >>>>> A similar transformation for CTZ is implemented as well. >>>>> So for code: >>>>> int foo (int i) >>>>> { >>>>> return (i == 0) ? 32 : __builtin_clz (i); >>>>> } >>>>> >>>>> On aarch64 we now emit: >>>>> foo: >>>>> clz w0, w0 >>>>> ret >>>>> >>>>> instead of: >>>>> foo: >>>>> mov w1, 32 >>>>> clz w2, w0 >>>>> cmp w0, 0 >>>>> csel w0, w2, w1, ne >>>>> ret >>>>> >>>>> and for arm similarly we generate: >>>>> foo: >>>>> clz r0, r0 >>>>> bx lr >>>>> >>>>> instead of: >>>>> foo: >>>>> cmp r0, #0 >>>>> clzne r0, r0 >>>>> moveq r0, #32 >>>>> bx lr >>>>> >>>>> >>>>> and for x86_64 with -O2 -mlzcnt we generate: >>>>> foo: >>>>> xorl %eax, %eax >>>>> lzcntl %edi, %eax >>>>> ret >>>>> >>>>> instead of: >>>>> foo: >>>>> xorl %eax, %eax >>>>> movl $32, %edx >>>>> lzcntl %edi, %eax >>>>> testl %edi, %edi >>>>> cmove %edx, %eax >>>>> ret >>>>> >>>>> >>>>> I tried getting this to work on other targets as well, but encountered >>>>> difficulties. >>>>> For example on powerpc the two arms of the condition seen during ifcvt >>>>> are: >>>>> >>>>> (insn 4 22 11 4 (set (reg:DI 156 [ <retval> ]) >>>>> (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64} >>>>> (nil)) >>>>> and >>>>> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0) >>>>> (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132 >>>>> {clzsi2} >>>>> (expr_list:REG_DEAD (reg/v:DI 157 [ i ]) >>>>> (nil))) >>>>> >>>>> So the setup code in noce_process_if_block sees that the set >>>>> destination >>>>> is >>>>> not the same >>>>> ((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ]) >>>>> 0)) >>>>> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b)) >>>>> check. >>>>> I suppose that's a consequence of how SImode operations are represented >>>>> in >>>>> early RTL >>>>> on powerpc, I don't know what to do there. Perhaps that part of ivcvt >>>>> can >>>>> be >>>>> taught to handle >>>>> destinations that are subregs of one another, but that would be a >>>>> separate >>>>> patch. >>>>> >>>>> Anyway, is this patch ok for trunk? >>>>> >>>>> Bootstrapped and tested on arm-none-linux-gnueabihf, >>>>> aarch64-none-linux-gnu, >>>>> x86_64-pc-linux-gnu. >>>>> >>>>> Thanks, >>>>> Kyrill >>>>> >>>>> 2016-05-26 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>>>> >>>>> PR middle-end/37780 >>>>> * ifcvt.c (noce_try_ifelse_collapse): New function. >>>>> Declare prototype. >>>>> (noce_process_if_block): Call noce_try_ifelse_collapse. >>>>> * simplify-rtx.c (simplify_cond_clz_ctz): New function. >>>>> (simplify_ternary_operation): Use the above to simplify >>>>> conditional CLZ/CTZ expressions. >>>>> >>>>> 2016-05-26 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>>>> >>>>> PR middle-end/37780 >>>>> * gcc.c-torture/execute/pr37780.c: New test. >>>>> * gcc.target/aarch64/pr37780_1.c: Likewise. >>>>> * gcc.target/arm/pr37780_1.c: Likewise. >>>> >>>> Hi Kyrylo, >>> >>> >>> Hi Christophe, >>> >>>> I've noticed that gcc.target/arm/pr37780_1.c fails on >>>> arm if arch < v6. >>>> I first tried to fix the effective-target guard (IIRC, the doc >>>> says clz is available starting with v5t), but that isn't sufficient. >>>> >>>> When compiling for armv5-t, the scan-assembler directives >>>> fail. It seems to work with v6t2, so I am wondering whether >>>> it's just a matter of increasing the effective-target arch version, >>>> or if you really intended to make the test pass on these old >>>> architectures? >>> >>> >>> I've dug into it a bit. >>> I think the problem is that CLZ is available with ARMv5T but only in arm >>> mode. >>> In Thumb mode it's only available with ARMv6T2. >>> So if your armv5t gcc compiles for Thumb by default you'll get the >>> failure. >> >> Actually this gcc is configured with default cpu & mode, >> target=arm-none-eabi. >> >> So I think it's arm7tdmi, arm mode. >> >>> I think just bumping the effective to armv6t2 here is appropriate as I >>> intended >>> the test to just check the ifcvt transformation when the instruction is >>> available >>> rather than test that the CLZ instruction is available at one >>> architecture >>> level >>> or another. >>> >>> A patch to bump the effective target check is pre-approved if you want to >>> do >>> it. >>> Otherwise I'll get to it in a few days. >>> >> I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but >> this would not be sufficient without adding >> dg-add-options arm_arch_v6t2 >> but then if gcc is configured with a higher default architecture, we'll >> end >> up testing if works for v6t2 only. > > > I'm okay with adding "dg-add-options arm_arch_v6t2". > We're not testing the availability of the instruction here, just the ifcvt > transformation when it *is* available, so I think we should just add > whatever > option guarantees that instruction is available. > > Kyrill >
OK, I've committed the attached patch. 2016-06-10 Christophe Lyon <christophe.l...@linaro.org> * gcc.target/arm/pr37780_1.c: Use arm_arch_v6t2 effective target and options. > >>> Thanks for spotting this, >>> Kyrill >>>> >>>> Thanks, >>>> >>>> Christophe. >>> >>> >
diff --git a/gcc/testsuite/gcc.target/arm/pr37780_1.c b/gcc/testsuite/gcc.target/arm/pr37780_1.c index b954fe5..8e06920 100644 --- a/gcc/testsuite/gcc.target/arm/pr37780_1.c +++ b/gcc/testsuite/gcc.target/arm/pr37780_1.c @@ -2,8 +2,9 @@ being defined at zero. */ /* { dg-do compile } */ -/* { dg-require-effective-target arm_arch_v5_ok } */ +/* { dg-require-effective-target arm_arch_v6t2_ok } */ /* { dg-options "-O2" } */ +/* { dg-add-options arm_arch_v6t2 } */ int fooctz (int i)