Hi Christophe, On 09/05/17 10:45, Christophe Lyon wrote: > Hi Bernd, > > > On 4 September 2017 at 16:52, Kyrill Tkachov > <kyrylo.tkac...@foss.arm.com> wrote: >> >> On 29/04/17 18:45, Bernd Edlinger wrote: >>> >>> Ping... >>> >>> I attached a rebased version since there was a merge conflict in >>> the xordi3 pattern, otherwise the patch is still identical. >>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2 >>> early when the target has no neon or iwmmxt. >>> >>> >>> Thanks >>> Bernd. >>> >>> >>> >>> On 11/28/16 20:42, Bernd Edlinger wrote: >>>> >>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote: >>>>> >>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger >>>>> <bernd.edlin...@hotmail.de> wrote: >>>>>> >>>>>> Hi! >>>>>> >>>>>> This improves the stack usage on the sha512 test case for the case >>>>>> without hardware fpu and without iwmmxt by splitting all di-mode >>>>>> patterns right while expanding which is similar to what the >>>>>> shift-pattern >>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well >>>>>> as >>>>>> thumb1. >>>>>> >>>>> I would go further and do this in the absence of Neon, the VFP unit >>>>> being there doesn't help with DImode operations i.e. we do not have 64 >>>>> bit integer arithmetic instructions without Neon. The main reason why >>>>> we have the DImode patterns split so late is to give a chance for >>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this >>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC >>>>> map down to these instructions. Doing this just for soft-float doesn't >>>>> improve the default case only. I don't usually test iwmmxt and I'm not >>>>> sure who has the ability to do so, thus keeping this restriction for >>>>> iwMMX is fine. >>>>> >>>>> >>>> Yes I understand, thanks for pointing that out. >>>> >>>> I was not aware what iwmmxt exists at all, but I noticed that most >>>> 64bit expansions work completely different, and would break if we split >>>> the pattern early. >>>> >>>> I can however only look at the assembler outout for iwmmxt, and make >>>> sure that the stack usage does not get worse. >>>> >>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as >>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack >>>> for the test cases, and vfp and soft-float at around 270 bytes stack >>>> usage. >>>> >>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!). >>>>>> >>>>>> Note this also splits many ldrd/strd instructions and therefore I will >>>>>> post a followup-patch that mitigates this effect by enabling the >>>>>> ldrd/strd >>>>>> peephole optimization after the necessary reg-testing. >>>>>> >>>>>> >>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf. >>>>> >>>>> What do you mean by arm-linux-gnueabihf - when folks say that I >>>>> interpret it as --with-arch=armv7-a --with-float=hard >>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon). >>>>> >>>>> If you've really bootstrapped and regtested it on armhf, doesn't this >>>>> patch as it stand have no effect there i.e. no change ? >>>>> arm-linux-gnueabihf usually means to me someone has configured with >>>>> --with-float=hard, so there are no regressions in the hard float ABI >>>>> case, >>>>> >>>> I know it proves little. When I say arm-linux-gnueabihf >>>> I do in fact mean --enable-languages=all,ada,go,obj-c++ >>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 >>>> --with-float=hard. >>>> >>>> My main interest in the stack usage is of course not because of linux, >>>> but because of eCos where we have very small task stacks and in fact >>>> no fpu support by the O/S at all, so that patch is exactly what we need. >>>> >>>> >>>> Bootstrapped and reg-tested on arm-linux-gnueabihf >>>> Is it OK for trunk? >> >> >> The code is ok. >> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as >> the splits >> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and >> TARGET_IWMMXT2 >> is irrelevant here). >> >> So this is ok for trunk. >> Thanks, and sorry again for the delay. >> Kyrill >> > > This patch (r251663) causes a regression on armeb-none-linux-gnueabihf > --with-mode arm > --with-cpu cortex-a9 > --with-fpu vfpv3-d16-fp16 > FAIL: gcc.dg/vect/vect-singleton_1.c (internal compiler error) > FAIL: gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects > (internal compiler error) > > the test passes if gcc is configured --with-fpu neon-fp16 >
Thank you very much for what you do! I am able to reproduce this. Combine creates an invalid insn out of these two insns: (insn 12 8 18 2 (parallel [ (set (reg:DI 122) (plus:DI (reg:DI 116 [ aD.5331 ]) (reg:DI 119 [ bD.5332 ]))) (clobber (reg:CC 100 cc)) ]) "vect-singleton_1.c":28 1 {*arm_adddi3} (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) (expr_list:REG_UNUSED (reg:CC 100 cc) (nil))))) (insn 18 12 19 2 (set (reg/i:DI 16 s0) (reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp} (expr_list:REG_DEAD (reg:DI 122) (nil))) => (insn 18 12 19 2 (parallel [ (set (reg/i:DI 16 s0) (plus:DI (reg:DI 116 [ aD.5331 ]) (reg:DI 119 [ bD.5332 ]))) (clobber (reg:CC 100 cc)) ]) "vect-singleton_1.c":28 1 {*arm_adddi3} (expr_list:REG_UNUSED (reg:CC 100 cc) (expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ]) (expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ]) (nil))))) But after reload this is fixed again, so that is no issue for a splitter that runs always after reload. But now the splitter runs into an assertion in gen_highpart(op0). The same problem should exist with the subdi3 as well, but not with the other patterns in this patch. I think the right way to fix this is use a non-vfp register predicate to prevent combine to create the un-splittable instruction in the first place. See attached my proposed fix for this defect. Christophe, You could do me a favor, and test it in your environment as I have nothing comparable. Kyrill, is this patch OK after bootstrap / reg-testing? Thanks Bernd.
2017-09-05 Bernd Edlinger <bernd.edlin...@hotmail.de> PR target/77308 * config/arm/predicates.md (s_register_operand_nv, arm_adddi_operand_nv): Create new non-vfp predicates. * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 251663) +++ gcc/config/arm/arm.md (working copy) @@ -457,14 +457,13 @@ ) (define_insn_and_split "*arm_adddi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r") - (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r") - (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd"))) + [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r,&r,&r") + (plus:DI (match_operand:DI 1 "s_register_operand_nv" "%0, 0, r, 0, r") + (match_operand:DI 2 "arm_adddi_operand_nv" "r, 0, r, Dd, Dd"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" - "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed) - && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))" + "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC_C CC_REGNUM) (compare:CC_C (plus:SI (match_dup 1) (match_dup 2)) (match_dup 1))) @@ -1263,9 +1262,9 @@ ) (define_insn_and_split "*arm_subdi3" - [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r") - (minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0") - (match_operand:DI 2 "s_register_operand" "r,0,0"))) + [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r") + (minus:DI (match_operand:DI 1 "s_register_operand_nv" "0,r,0") + (match_operand:DI 2 "s_register_operand_nv" "r,0,0"))) (clobber (reg:CC CC_REGNUM))] "TARGET_32BIT && !TARGET_NEON" "#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2" Index: gcc/config/arm/predicates.md =================================================================== --- gcc/config/arm/predicates.md (revision 251663) +++ gcc/config/arm/predicates.md (working copy) @@ -653,3 +653,11 @@ (ior (and (match_code "symbol_ref") (match_test "!arm_is_long_call_p (SYMBOL_REF_DECL (op))")) (match_operand 0 "s_register_operand"))) + +(define_predicate "s_register_operand_nv" + (and (match_operand 0 "s_register_operand") + (not (match_operand 0 "vfp_hard_register_operand")))) + +(define_predicate "arm_adddi_operand_nv" + (and (match_operand 0 "arm_adddi_operand") + (not (match_operand 0 "vfp_hard_register_operand"))))