On 8 June 2016 at 18:40, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > > On 07/06/16 20:34, Christophe Lyon wrote: >> >> On 26 May 2016 at 11:53, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> >> wrote: >>> >>> Hi all, >>> >>> In this PR we want to optimise: >>> int foo (int i) >>> { >>> return (i == 0) ? N : __builtin_clz (i); >>> } >>> >>> on targets where CLZ is defined at zero to the constant 'N'. >>> This is determined at the RTL level through the CLZ_DEFINED_VALUE_AT_ZERO >>> macro. >>> The obvious place to implement this would be in combine through >>> simplify-rtx >>> where we'd >>> recognise an IF_THEN_ELSE of the form: >>> (set (reg:SI r1) >>> (if_then_else:SI (ne (reg:SI r2) >>> (const_int 0 [0])) >>> (clz:SI (reg:SI r2)) >>> (const_int 32))) >>> >>> and if CLZ_DEFINED_VALUE_AT_ZERO is defined to 32 for SImode we'd >>> simplify >>> it into >>> just (clz:SI (reg:SI r2)). >>> However, I found this doesn't quite happen for a couple of reasons: >>> 1) This depends on ifcvt or some other pass to have created a conditional >>> move of the >>> two branches that provide the IF_THEN_ELSE to propagate the const_int and >>> clz operation into. >>> >>> 2) Combine will refuse to propagate r2 from the above example into both >>> the >>> condition and the >>> CLZ at the same time, so the most we see is: >>> (set (reg:SI r1) >>> (if_then_else:SI (ne (reg:CC cc) >>> (const_int 0)) >>> (clz:SI (reg:SI r2)) >>> (const_int 32))) >>> >>> which is not enough information to perform the simplification. >>> >>> This patch implements the optimisation in ce1 using the noce ifcvt >>> framework. >>> During ifcvt noce_process_if_block can see that we're trying to optimise >>> something >>> of the form (x == 0 ? const_int : CLZ (x)) and so it has visibility of >>> all >>> the information >>> needed to perform the transformation. >>> >>> The transformation is performed by adding a new noce_try* function that >>> tries to put the >>> condition and the 'then' and 'else' arms into an IF_THEN_ELSE rtx and try >>> to >>> simplify that >>> using the simplify-rtx machinery. That way, we can implement the >>> simplification logic in >>> simplify-rtx.c where it belongs. >>> >>> A similar transformation for CTZ is implemented as well. >>> So for code: >>> int foo (int i) >>> { >>> return (i == 0) ? 32 : __builtin_clz (i); >>> } >>> >>> On aarch64 we now emit: >>> foo: >>> clz w0, w0 >>> ret >>> >>> instead of: >>> foo: >>> mov w1, 32 >>> clz w2, w0 >>> cmp w0, 0 >>> csel w0, w2, w1, ne >>> ret >>> >>> and for arm similarly we generate: >>> foo: >>> clz r0, r0 >>> bx lr >>> >>> instead of: >>> foo: >>> cmp r0, #0 >>> clzne r0, r0 >>> moveq r0, #32 >>> bx lr >>> >>> >>> and for x86_64 with -O2 -mlzcnt we generate: >>> foo: >>> xorl %eax, %eax >>> lzcntl %edi, %eax >>> ret >>> >>> instead of: >>> foo: >>> xorl %eax, %eax >>> movl $32, %edx >>> lzcntl %edi, %eax >>> testl %edi, %edi >>> cmove %edx, %eax >>> ret >>> >>> >>> I tried getting this to work on other targets as well, but encountered >>> difficulties. >>> For example on powerpc the two arms of the condition seen during ifcvt >>> are: >>> >>> (insn 4 22 11 4 (set (reg:DI 156 [ <retval> ]) >>> (const_int 32 [0x20])) clz.c:3 434 {*movdi_internal64} >>> (nil)) >>> and >>> (insn 10 9 23 3 (set (subreg/s/u:SI (reg:DI 156 [ <retval> ]) 0) >>> (clz:SI (subreg/u:SI (reg/v:DI 157 [ i ]) 0))) clz.c:3 132 >>> {clzsi2} >>> (expr_list:REG_DEAD (reg/v:DI 157 [ i ]) >>> (nil))) >>> >>> So the setup code in noce_process_if_block sees that the set destination >>> is >>> not the same >>> ((reg:DI 156 [ <retval> ]) and (subreg/s/u:SI (reg:DI 156 [ <retval> ]) >>> 0)) >>> so it bails out on the rtx_interchangeable_p (x, SET_DEST (set_b)) check. >>> I suppose that's a consequence of how SImode operations are represented >>> in >>> early RTL >>> on powerpc, I don't know what to do there. Perhaps that part of ivcvt can >>> be >>> taught to handle >>> destinations that are subregs of one another, but that would be a >>> separate >>> patch. >>> >>> Anyway, is this patch ok for trunk? >>> >>> Bootstrapped and tested on arm-none-linux-gnueabihf, >>> aarch64-none-linux-gnu, >>> x86_64-pc-linux-gnu. >>> >>> Thanks, >>> Kyrill >>> >>> 2016-05-26 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>> >>> PR middle-end/37780 >>> * ifcvt.c (noce_try_ifelse_collapse): New function. >>> Declare prototype. >>> (noce_process_if_block): Call noce_try_ifelse_collapse. >>> * simplify-rtx.c (simplify_cond_clz_ctz): New function. >>> (simplify_ternary_operation): Use the above to simplify >>> conditional CLZ/CTZ expressions. >>> >>> 2016-05-26 Kyrylo Tkachov <kyrylo.tkac...@arm.com> >>> >>> PR middle-end/37780 >>> * gcc.c-torture/execute/pr37780.c: New test. >>> * gcc.target/aarch64/pr37780_1.c: Likewise. >>> * gcc.target/arm/pr37780_1.c: Likewise. >> >> Hi Kyrylo, > > > Hi Christophe, > >> I've noticed that gcc.target/arm/pr37780_1.c fails on >> arm if arch < v6. >> I first tried to fix the effective-target guard (IIRC, the doc >> says clz is available starting with v5t), but that isn't sufficient. >> >> When compiling for armv5-t, the scan-assembler directives >> fail. It seems to work with v6t2, so I am wondering whether >> it's just a matter of increasing the effective-target arch version, >> or if you really intended to make the test pass on these old >> architectures? > > > I've dug into it a bit. > I think the problem is that CLZ is available with ARMv5T but only in arm > mode. > In Thumb mode it's only available with ARMv6T2. > So if your armv5t gcc compiles for Thumb by default you'll get the failure.
Actually this gcc is configured with default cpu & mode, target=arm-none-eabi. So I think it's arm7tdmi, arm mode. > I think just bumping the effective to armv6t2 here is appropriate as I > intended > the test to just check the ifcvt transformation when the instruction is > available > rather than test that the CLZ instruction is available at one architecture > level > or another. > > A patch to bump the effective target check is pre-approved if you want to do > it. > Otherwise I'll get to it in a few days. > I was a about to commit the arm_arch_v5_ok -> arm_arch_v6t2_ok, but this would not be sufficient without adding dg-add-options arm_arch_v6t2 but then if gcc is configured with a higher default architecture, we'll end up testing if works for v6t2 only. > Thanks for spotting this, > Kyrill >> >> Thanks, >> >> Christophe. > >