x86 gcc lacks simple optimization
Hi, Consider code: int foo(char *t, char *v, int w) { int i; for (i = 1; i != w; ++i) { int x = i << 2; v[x + 4] = t[x + 4]; } return 0; } Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: gcc -O2 -m32 -S test.c You will see loop, formed like: .L5: leal 0(,%eax,4), %edx addl $1, %eax movzbl 4(%edi,%edx), %ecx cmpl %ebx, %eax movb %cl, 4(%esi,%edx) jne .L5 But it can be easily simplified to something like this: .L5: addl $1, %eax movzbl (%esi,%eax,4), %edx cmpl %ecx, %eax movb %dl, (%ebx,%eax,4) jne .L5 (i.e. left shift may be moved to address). First question to gcc-help maillist. May be there are some options, that I've missed, and there IS a way to explain gcc my intention to do this? And second question to gcc developers mail list. I am working on private backend and want to add this optimization to my backend. What do you advise me to do -- custom gimple pass, or rtl pass, or modify some existent pass, etc? --- With best regards, Konstantin
RE: Controling reloads of movsicc pattern
Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc 4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the moment (i.e. disable wide offset MEM on conditional moves). Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or maybe LRA will be merged in 4.7.4 ?) Thanks Selim -Message d'origine- De : Jeff Law [mailto:l...@redhat.com] Envoyé : mercredi 4 décembre 2013 18:02 À : BELBACHIR Selim; gcc@gcc.gnu.org Objet : Re: Controling reloads of movsicc pattern On 12/04/13 03:22, BELBACHIR Selim wrote: > Hi, > > My target has : > - 2 registers class to store SImode (like m68k, data $D & address $A). > - moves from wide offset MEM to $D or $A (ex: mov d($A1+50),$A2 ormov > d($A1+50),$D1) > - conditional moves from offset MEM to $D or $A but with a restriction : > offset MEM conditionally moved to $A has a limited offset of > 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ > d($A1,50),$D1) > > The predicate of movsicc pattern tells GCC that wide offset MEM is allowed > and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and > 'restricted offset MEM -> $A" : > > (define_insn_and_split "movsicc_internal" >[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m") > (if_then_else:SI >(match_operator 1 "prism_comparison_operator" > [(match_operand 4 "cc_register" "") (const_int 0)]) >(match_operand:SI 2 "nonimmediate_operand" " v,m,r,0,0,0,v,m,r") > ;; "v" constraint is for restricted offset MEM >(match_operand:SI 3 " nonimmediate_operand" " > 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match > the other alternatives > > > > I encountered : (on gcc4.7.3) > > core_main.c:354:1: error: insn does not satisfy its constraints: > (insn 1176 1175 337 26 (set (reg:SI 5 $A5) > (if_then_else:SI (ne (reg:CC 56 $CCI) > (const_int 0 [0])) > (mem/c:SI (plus:SI (reg/f:SI 0 $A0) > (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32]) > (const_int 1 [0x1]))) core_main.c:211:32 158 > {movsicc_internal} > > Due to reload pass (core_main.c.199r.reload). > > > How can I tune reload or write my movsicc pattern to prevent reload pass from > generating a conditional move from wide offset MEM to $A registers ?? If at all possible, I would recommend switching to LRA. There's an up-front cost, but it's definitely the direction all ports should be heading. Avoiding reload is, umm, good. jeff
Re: Truncate optimisation question
Eric Botcazou writes: >> Well, I think making the simplify-rtx code conditional on the target >> would be the wrong way to go. If we really can't live with it being >> unconditional then I think we should revert it. But like I say I think >> it would be better to make combine recognise the redundancy even with >> the new form. (Or as I say, longer term, not to rely on combine to >> eliminate redundant extensions.) But I don't have time to do that myself... > > It helps x86 so we won't revert it. My fear is that we'll need to add code > in > other places to RISCify back the result of this "simplification". Sorry, realised I didn't respond to this yesterday. I wasn't suggesting we just revert and walk away. ISTR the original suggestion was to patch combine instead of simplify-rtx.c, so we could back to that. Thanks, Richard
Re: Dependency confusion in sched-deps
On 06-Dec-13 01:34 AM, Maxim Kuvyrkov wrote: On 6/12/2013, at 8:44 am, shmeel gutl wrote: On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote: Dependency type plays a role for estimating costs and latencies between instructions (which affects performance), but using wrong or imprecise dependency type does not affect correctness. On multi-issue architectures it does make a difference. Anti dependence permits the two instructions to be issued during the same cycle whereas true dependency and output dependency would forbid this. Or am I misinterpreting your comment? On VLIW-flavoured machines without resource conflict checking -- "yes", it is critical not to use anti dependency where an output or true dependency exist. This is the case though, only because these machines do not follow sequential semantics for instruction execution (i.e., effects from previous instructions are not necessarily observed by subsequent instructions on the same/close cycles. On machines with internal resource conflict checking having a wrong type on the dependency should not cause wrong behavior, but "only" suboptimal performance. Thank you, -- Maxim Kuvyrkov www.kugelworks.com Earlier in the thread you wrote Output dependency is the right type (write after write). Anti dependency is write after read, and true dependency is read after write. Should the code be changed to accommodate vliw machines.. It has been there since the module was originally checked into trunk.
Re: Controling reloads of movsicc pattern
On Fri, Dec 6, 2013 at 12:41 AM, BELBACHIR Selim wrote: > Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc > 4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the > moment (i.e. disable wide offset MEM on conditional moves). > Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or maybe LRA > will be merged in 4.7.4 ?) If this is the Ada front-end, then it is already part of 4.8 release. Or is this some other front-end? Thanks, Andrew Pinski > > Thanks > > Selim > > -Message d'origine- > De : Jeff Law [mailto:l...@redhat.com] > Envoyé : mercredi 4 décembre 2013 18:02 > À : BELBACHIR Selim; gcc@gcc.gnu.org > Objet : Re: Controling reloads of movsicc pattern > > On 12/04/13 03:22, BELBACHIR Selim wrote: >> Hi, >> >> My target has : >> - 2 registers class to store SImode (like m68k, data $D & address $A). >> - moves from wide offset MEM to $D or $A (ex: mov d($A1+50),$A2 or >> mov d($A1+50),$D1) >> - conditional moves from offset MEM to $D or $A but with a restriction : >> offset MEM conditionally moved to $A has a limited offset of >> 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ >> d($A1,50),$D1) >> >> The predicate of movsicc pattern tells GCC that wide offset MEM is allowed >> and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and >> 'restricted offset MEM -> $A" : >> >> (define_insn_and_split "movsicc_internal" >>[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m") >> (if_then_else:SI >>(match_operator 1 "prism_comparison_operator" >> [(match_operand 4 "cc_register" "") (const_int 0)]) >>(match_operand:SI 2 "nonimmediate_operand" " >> v,m,r,0,0,0,v,m,r") ;; "v" constraint is for restricted offset MEM >>(match_operand:SI 3 " nonimmediate_operand" " >> 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match >> the other alternatives >> >> >> >> I encountered : (on gcc4.7.3) >> >> core_main.c:354:1: error: insn does not satisfy its constraints: >> (insn 1176 1175 337 26 (set (reg:SI 5 $A5) >> (if_then_else:SI (ne (reg:CC 56 $CCI) >> (const_int 0 [0])) >> (mem/c:SI (plus:SI (reg/f:SI 0 $A0) >> (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32]) >> (const_int 1 [0x1]))) core_main.c:211:32 158 >> {movsicc_internal} >> >> Due to reload pass (core_main.c.199r.reload). >> >> >> How can I tune reload or write my movsicc pattern to prevent reload pass >> from generating a conditional move from wide offset MEM to $A registers ?? > If at all possible, I would recommend switching to LRA. There's an up-front > cost, but it's definitely the direction all ports should be heading. > Avoiding reload is, umm, good. > > jeff >
Re: x86 gcc lacks simple optimization
On 06/12/13 09:30, Konstantin Vladimirov wrote: > Hi, > > Consider code: > > int foo(char *t, char *v, int w) > { > int i; > > for (i = 1; i != w; ++i) > { > int x = i << 2; > v[x + 4] = t[x + 4]; > } > > return 0; > } > > Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: > > gcc -O2 -m32 -S test.c > > You will see loop, formed like: > > .L5: > leal 0(,%eax,4), %edx > addl $1, %eax > movzbl 4(%edi,%edx), %ecx > cmpl %ebx, %eax > movb %cl, 4(%esi,%edx) > jne .L5 > > But it can be easily simplified to something like this: > > .L5: > addl $1, %eax > movzbl (%esi,%eax,4), %edx > cmpl %ecx, %eax > movb %dl, (%ebx,%eax,4) > jne .L5 > > (i.e. left shift may be moved to address). > > First question to gcc-help maillist. May be there are some options, > that I've missed, and there IS a way to explain gcc my intention to do > this? > > And second question to gcc developers mail list. I am working on > private backend and want to add this optimization to my backend. What > do you advise me to do -- custom gimple pass, or rtl pass, or modify > some existent pass, etc? > Hi, Usually the gcc developers are not keen on emails going to both the help and development list - they prefer to keep them separate. My first thought when someone finds a "missed optimisation" issue, especially with the x86 target, is are you /sure/ this code is slower? x86 chips are immensely complex, and the interplay between different instructions, pipelines, superscaling, etc., means that code that might appear faster, can actually be slower. So please check your architecture flags (i.e., are you optimising for the "native" cpu, or any other specific cpu - optimised code can be different for different x86 cpus). Then /measure/ the speed of the code to see if there is a real difference. Regarding your "private backend" - is this a modification of the x86 backend, or a completely different target? If it is x86, then I think the answer is "don't do it - work with the mainline code". If it is something else, then an x86-specific optimisation is of little use anyway. mvh., David
RE: Controling reloads of movsicc pattern
Any gnat official release ? Maybe gnat 7.2 beta is based on 4.8, I'll try to get this one. -Message d'origine- De : Andrew Pinski [mailto:pins...@gmail.com] Envoyé : vendredi 6 décembre 2013 09:54 À : BELBACHIR Selim Cc : Jeff Law; gcc@gcc.gnu.org Objet : Re: Controling reloads of movsicc pattern On Fri, Dec 6, 2013 at 12:41 AM, BELBACHIR Selim wrote: > Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc > 4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the > moment (i.e. disable wide offset MEM on conditional moves). > Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or > maybe LRA will be merged in 4.7.4 ?) If this is the Ada front-end, then it is already part of 4.8 release. Or is this some other front-end? Thanks, Andrew Pinski > > Thanks > > Selim > > -Message d'origine- > De : Jeff Law [mailto:l...@redhat.com] > Envoyé : mercredi 4 décembre 2013 18:02 À : BELBACHIR Selim; > gcc@gcc.gnu.org Objet : Re: Controling reloads of movsicc pattern > > On 12/04/13 03:22, BELBACHIR Selim wrote: >> Hi, >> >> My target has : >> - 2 registers class to store SImode (like m68k, data $D & address $A). >> - moves from wide offset MEM to $D or $A (ex: mov d($A1+50),$A2 or >> mov d($A1+50),$D1) >> - conditional moves from offset MEM to $D or $A but with a restriction : >> offset MEM conditionally moved to $A has a limited offset >> of >> 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ >> d($A1,50),$D1) >> >> The predicate of movsicc pattern tells GCC that wide offset MEM is allowed >> and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and >> 'restricted offset MEM -> $A" : >> >> (define_insn_and_split "movsicc_internal" >>[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m") >> (if_then_else:SI >>(match_operator 1 "prism_comparison_operator" >> [(match_operand 4 "cc_register" "") (const_int 0)]) >>(match_operand:SI 2 "nonimmediate_operand" " >> v,m,r,0,0,0,v,m,r") ;; "v" constraint is for restricted offset MEM >>(match_operand:SI 3 " nonimmediate_operand" " >> 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match >> the other alternatives >> >> >> >> I encountered : (on gcc4.7.3) >> >> core_main.c:354:1: error: insn does not satisfy its constraints: >> (insn 1176 1175 337 26 (set (reg:SI 5 $A5) >> (if_then_else:SI (ne (reg:CC 56 $CCI) >> (const_int 0 [0])) >> (mem/c:SI (plus:SI (reg/f:SI 0 $A0) >> (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32]) >> (const_int 1 [0x1]))) core_main.c:211:32 158 >> {movsicc_internal} >> >> Due to reload pass (core_main.c.199r.reload). >> >> >> How can I tune reload or write my movsicc pattern to prevent reload pass >> from generating a conditional move from wide offset MEM to $A registers ?? > If at all possible, I would recommend switching to LRA. There's an up-front > cost, but it's definitely the direction all ports should be heading. > Avoiding reload is, umm, good. > > jeff >
Re: x86 gcc lacks simple optimization
Hi, Example from x86 code was only for ease of reproduction. I am pretty sure, this is architecture-independent issue. Say on ARM: .L2: mov ip, r3, asl #2 add ip, ip, #4 add r3, r3, #1 ldrb r4, [r0, ip] @ zero_extendqisi2 cmp r3, r2 strb r4, [r1, ip] bne .L2 May be improved to: .L2: add r3, r3, #1 ldrb ip, [r0, r3, asl #2] @ zero_extendqisi2 cmp r3, r2 strb ip, [r1, r3, asl #2] bne .L2 And so on. I myself feeling more comfortable with x86, but it is only a matter of taste. To get improved version of code, I just do by hands what compiler is expected to do automatically, i.e. rewritten things as: int foo(char *t, char *v, int w) { int i; for (i = 1; i != w; ++i) { v[(i + 1) << 2] = t[(i + 1) << 2]; } return 0; } Private backend, I am working on isn't a modification of any, it is private backend, written from scratch. --- With best regards, Konstantin On Fri, Dec 6, 2013 at 1:27 PM, David Brown wrote: > On 06/12/13 09:30, Konstantin Vladimirov wrote: >> Hi, >> >> Consider code: >> >> int foo(char *t, char *v, int w) >> { >> int i; >> >> for (i = 1; i != w; ++i) >> { >> int x = i << 2; >> v[x + 4] = t[x + 4]; >> } >> >> return 0; >> } >> >> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: >> >> gcc -O2 -m32 -S test.c >> >> You will see loop, formed like: >> >> .L5: >> leal 0(,%eax,4), %edx >> addl $1, %eax >> movzbl 4(%edi,%edx), %ecx >> cmpl %ebx, %eax >> movb %cl, 4(%esi,%edx) >> jne .L5 >> >> But it can be easily simplified to something like this: >> >> .L5: >> addl $1, %eax >> movzbl (%esi,%eax,4), %edx >> cmpl %ecx, %eax >> movb %dl, (%ebx,%eax,4) >> jne .L5 >> >> (i.e. left shift may be moved to address). >> >> First question to gcc-help maillist. May be there are some options, >> that I've missed, and there IS a way to explain gcc my intention to do >> this? >> >> And second question to gcc developers mail list. I am working on >> private backend and want to add this optimization to my backend. What >> do you advise me to do -- custom gimple pass, or rtl pass, or modify >> some existent pass, etc? >> > > Hi, > > Usually the gcc developers are not keen on emails going to both the help > and development list - they prefer to keep them separate. > > My first thought when someone finds a "missed optimisation" issue, > especially with the x86 target, is are you /sure/ this code is slower? > x86 chips are immensely complex, and the interplay between different > instructions, pipelines, superscaling, etc., means that code that might > appear faster, can actually be slower. So please check your > architecture flags (i.e., are you optimising for the "native" cpu, or > any other specific cpu - optimised code can be different for different > x86 cpus). Then /measure/ the speed of the code to see if there is a > real difference. > > > Regarding your "private backend" - is this a modification of the x86 > backend, or a completely different target? If it is x86, then I think > the answer is "don't do it - work with the mainline code". If it is > something else, then an x86-specific optimisation is of little use anyway. > > mvh., > > David > > >
Re: Hmmm, I think we've seen this problem before (lto build):
On Fri, Dec 6, 2013 at 5:47 AM, Trevor Saunders wrote: > On Mon, Dec 02, 2013 at 12:16:18PM +0100, Richard Biener wrote: >> On Sun, Dec 1, 2013 at 12:30 PM, Toon Moene wrote: >> > http://gcc.gnu.org/ml/gcc-testresults/2013-12/msg1.html >> > >> > FAILED: Bootstrap (build config: lto; languages: fortran; trunk revision >> > 205557) on x86_64-unknown-linux-gnu >> > >> > In function 'release', >> > inlined from 'release' at /home/toon/compilers/gcc/gcc/vec.h:1428:3, >> > inlined from '__base_dtor ' at >> > /home/toon/compilers/gcc/gcc/vec.h:1195:0, >> > inlined from 'compute_antic_aux' at >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2212:0, >> > inlined from 'compute_antic' at >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2493:0, >> > inlined from 'do_pre' at >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4738:23, >> > inlined from 'execute' at >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4818:0: >> > /home/toon/compilers/gcc/gcc/vec.h:312:3: error: attempt to free a non-heap >> > object 'worklist' [-Werror=free-nonheap-object] >> >::free (v); >> >^ >> > lto1: all warnings being treated as errors >> > make[4]: *** [/dev/shm/wd26755/cczzGuTZ.ltrans13.ltrans.o] Error 1 >> > make[4]: *** Waiting for unfinished jobs >> > lto-wrapper: make returned 2 exit status >> > /usr/bin/ld: lto-wrapper failed >> > collect2: error: ld returned 1 exit status >> >> Yes, I still see this - likely caused by IPA-CP / partial inlining and a >> "bogus" >> warning for unreachable code. > > I'm really sorry about long delay here, I took a week off for > thanksgiving then was pretty busy with other stuff :/ > > If I remove the now useless worklist.release (); on line 2211 of > tree-ssa-pre.c lto bootstrap gets passed this issue to a stage 2 / 3 > comparison failure. However doing that also causes these two test > failures in a normal bootstrap / regression test cycle > > Tests that now fail, but worked before: > > unix/-m32: 17_intro/headers/c++200x/stdc++.cc (test for excess errors) > unix/-m32: 17_intro/headers/c++200x/stdc++_multiple_inclusion.cc (test > for excess errors) > > both of these failures are because of this ICE This must be unrelated - please go ahead and install the patch removing the useless worklist.release () from tree-ssa-pre.c Thanks, Richard. > Executing on host: /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ > -shared-libgcc -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++ > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/ > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/ > -isystem > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include > -isystem > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include > -m32 > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0 > -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2 > -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++ > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsupc++ > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/include/backward > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/util > /tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/17_intro/headers/c++200x/stdc++_multiple_inclusion.cc > -std=gnu++0x -S -m32 -o stdc++_multiple_inclusion.s(timeout = 600) > spawn /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ -shared-libgcc > -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++ > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/ > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/ > -isystem > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include > -isystem > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include > -m32 > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0 > -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2 > -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++ > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsup
Re: Truncate optimisation question
On Fri, Dec 6, 2013 at 9:42 AM, Richard Sandiford wrote: > Eric Botcazou writes: >>> Well, I think making the simplify-rtx code conditional on the target >>> would be the wrong way to go. If we really can't live with it being >>> unconditional then I think we should revert it. But like I say I think >>> it would be better to make combine recognise the redundancy even with >>> the new form. (Or as I say, longer term, not to rely on combine to >>> eliminate redundant extensions.) But I don't have time to do that myself... >> >> It helps x86 so we won't revert it. My fear is that we'll need to add code >> in >> other places to RISCify back the result of this "simplification". > > Sorry, realised I didn't respond to this yesterday. I wasn't suggesting > we just revert and walk away. ISTR the original suggestion was to patch > combine instead of simplify-rtx.c, so we could back to that. I think that looks most sensible. Richard. > Thanks, > Richard
Re: x86 gcc lacks simple optimization
On Fri, Dec 06, 2013 at 12:30:54PM +0400, Konstantin Vladimirov wrote: > Consider code: > > int foo(char *t, char *v, int w) > { > int i; > > for (i = 1; i != w; ++i) > { > int x = i << 2; > v[x + 4] = t[x + 4]; > } > > return 0; > } This is either job of ivopts pass, dunno why it doesn't consider turning those memory accesses to TARGET_MEM_REF with the *4 multiplication in there, or combiner (which is limited to single use though, so if you do say v[x + 4] = t[i]; int he loop instead, it will for -m32 put use the (...,4) addressing, but as you have two uses, it doesn't do it). As others said, the question is if using the more complex addressing more than once is actually beneficial or not. Anyway, while on this testcase, I wonder why VRP doesn't derive ranges here. : # RANGE [-2147483648, 2147483647] NONZERO 0x0fffc x_5 = i_1 << 2; # RANGE ~[2147483648, 18446744071562067967] NONZERO 0x0fffc _6 = (sizetype) x_5; # RANGE ~[2147483652, 18446744071562067971] NONZERO 0x0fffc _7 = _6 + 4; # PT = nonlocal _9 = v_8(D) + _7; # PT = nonlocal _11 = t_10(D) + _7; _12 = *_11; *_9 = _12; i_14 = i_1 + 1; : # i_1 = PHI <1(2), i_14(3)> if (i_1 != w_4(D)) goto ; else goto ; As i is signed integer with undefined overflow, and the loop starts with 1 and it is always only incremented, can't we derive # RANGE [1, 2147483647] # i_1 = PHI <1(2), i_14(3)> and similarly for i_14? We likely can't derive similar range for x_5 because at least in C++ it isn't undefined behavior if it shits into negative (or is it?), but at least with x = i * 4; instead we could. Jakub
Re: x86 gcc lacks simple optimization
On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov wrote: > Hi, > > Consider code: > > int foo(char *t, char *v, int w) > { > int i; > > for (i = 1; i != w; ++i) > { > int x = i << 2; > v[x + 4] = t[x + 4]; > } > > return 0; > } > > Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: > > gcc -O2 -m32 -S test.c > > You will see loop, formed like: > > .L5: > leal 0(,%eax,4), %edx > addl $1, %eax > movzbl 4(%edi,%edx), %ecx > cmpl %ebx, %eax > movb %cl, 4(%esi,%edx) > jne .L5 > > But it can be easily simplified to something like this: > > .L5: > addl $1, %eax > movzbl (%esi,%eax,4), %edx > cmpl %ecx, %eax > movb %dl, (%ebx,%eax,4) > jne .L5 > > (i.e. left shift may be moved to address). > > First question to gcc-help maillist. May be there are some options, > that I've missed, and there IS a way to explain gcc my intention to do > this? > > And second question to gcc developers mail list. I am working on > private backend and want to add this optimization to my backend. What > do you advise me to do -- custom gimple pass, or rtl pass, or modify > some existent pass, etc? This looks like a deficiency in induction variable optimization. Note that i << 2 may overflow and this overflow does not invoke undefined behavior but is in the implementation defined behavior category. The issue in this case is likely that the SCEV infrastructure does not handle left-shifts. Richard. > --- > With best regards, Konstantin
Re: x86 gcc lacks simple optimization
Hi, nothing changes if everything is unsigned and we are guaranteed to not raise UB on overflow: unsigned foo(unsigned char *t, unsigned char *v, unsigned w) { unsigned i; for (i = 1; i != w; ++i) { unsigned x = i << 2; v[x + 4] = t[x + 4]; } return 0; } yields: .L5: leal 0(,%eax,4), %edx addl $1, %eax movzbl 4(%edi,%edx), %ecx cmpl %ebx, %eax movb %cl, 4(%esi,%edx) jne .L5 What is SCEV infrastructure (guessing scalar evolutions?) and what files/passes to look in? --- With best regards, Konstantin On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener wrote: > On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov > wrote: >> Hi, >> >> Consider code: >> >> int foo(char *t, char *v, int w) >> { >> int i; >> >> for (i = 1; i != w; ++i) >> { >> int x = i << 2; >> v[x + 4] = t[x + 4]; >> } >> >> return 0; >> } >> >> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: >> >> gcc -O2 -m32 -S test.c >> >> You will see loop, formed like: >> >> .L5: >> leal 0(,%eax,4), %edx >> addl $1, %eax >> movzbl 4(%edi,%edx), %ecx >> cmpl %ebx, %eax >> movb %cl, 4(%esi,%edx) >> jne .L5 >> >> But it can be easily simplified to something like this: >> >> .L5: >> addl $1, %eax >> movzbl (%esi,%eax,4), %edx >> cmpl %ecx, %eax >> movb %dl, (%ebx,%eax,4) >> jne .L5 >> >> (i.e. left shift may be moved to address). >> >> First question to gcc-help maillist. May be there are some options, >> that I've missed, and there IS a way to explain gcc my intention to do >> this? >> >> And second question to gcc developers mail list. I am working on >> private backend and want to add this optimization to my backend. What >> do you advise me to do -- custom gimple pass, or rtl pass, or modify >> some existent pass, etc? > > This looks like a deficiency in induction variable optimization. Note > that i << 2 may overflow and this overflow does not invoke undefined > behavior but is in the implementation defined behavior category. > > The issue in this case is likely that the SCEV infrastructure does not handle > left-shifts. > > Richard. > >> --- >> With best regards, Konstantin
Re: x86 gcc lacks simple optimization
On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov wrote: > Hi, > > nothing changes if everything is unsigned and we are guaranteed to not > raise UB on overflow: > > unsigned foo(unsigned char *t, unsigned char *v, unsigned w) > { > unsigned i; > > for (i = 1; i != w; ++i) > { > unsigned x = i << 2; > v[x + 4] = t[x + 4]; > } > > return 0; > } > > yields: > > .L5: > leal 0(,%eax,4), %edx > addl $1, %eax > movzbl 4(%edi,%edx), %ecx > cmpl %ebx, %eax > movb %cl, 4(%esi,%edx) > jne .L5 > > What is SCEV infrastructure (guessing scalar evolutions?) and what > files/passes to look in? tree-scalar-evolution.c, look at where it handles MULT_EXPR but lacks LSHIFT_EXPR support. Richard. > --- > With best regards, Konstantin > > On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener > wrote: >> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov >> wrote: >>> Hi, >>> >>> Consider code: >>> >>> int foo(char *t, char *v, int w) >>> { >>> int i; >>> >>> for (i = 1; i != w; ++i) >>> { >>> int x = i << 2; >>> v[x + 4] = t[x + 4]; >>> } >>> >>> return 0; >>> } >>> >>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: >>> >>> gcc -O2 -m32 -S test.c >>> >>> You will see loop, formed like: >>> >>> .L5: >>> leal 0(,%eax,4), %edx >>> addl $1, %eax >>> movzbl 4(%edi,%edx), %ecx >>> cmpl %ebx, %eax >>> movb %cl, 4(%esi,%edx) >>> jne .L5 >>> >>> But it can be easily simplified to something like this: >>> >>> .L5: >>> addl $1, %eax >>> movzbl (%esi,%eax,4), %edx >>> cmpl %ecx, %eax >>> movb %dl, (%ebx,%eax,4) >>> jne .L5 >>> >>> (i.e. left shift may be moved to address). >>> >>> First question to gcc-help maillist. May be there are some options, >>> that I've missed, and there IS a way to explain gcc my intention to do >>> this? >>> >>> And second question to gcc developers mail list. I am working on >>> private backend and want to add this optimization to my backend. What >>> do you advise me to do -- custom gimple pass, or rtl pass, or modify >>> some existent pass, etc? >> >> This looks like a deficiency in induction variable optimization. Note >> that i << 2 may overflow and this overflow does not invoke undefined >> behavior but is in the implementation defined behavior category. >> >> The issue in this case is likely that the SCEV infrastructure does not handle >> left-shifts. >> >> Richard. >> >>> --- >>> With best regards, Konstantin
Make SImode as default mode for INT type.
Hi all, We are re-targeting the gcc 4.8.1 to the 16 bit core ,where word =int = short = pointer= 16 , char = 8 bit and long =32 bit. We model the above requirement as #define BITS_PER_UNIT 8 #define BITS_PER_WORD 16 #define UNITS_PER_WORD 2 #define POINTER_SIZE16 #define SHORT_TYPE_SIZE 16 #define INT_TYPE_SIZE 16 #define LONG_TYPE_SIZE 32 #define FLOAT_TYPE_SIZE 16 #define DOUBLE_TYPE_SIZE32 Tried to compile the below sample by retargeted compiler int a =10; int b =10; int func() { return a+ b; } the compiler is stating that the a and b are global with short type(HI mode) of size 2 bytes. where as we need the word mode as SI not HI ,I do understand that the SI and HI modes are of same size but till I insist better to have SI mode. Please somebody or expert in the group share their thought on the same like how do we can achieve this ? Thanks ~Umesh
[Warning] Signed mistach for basic datatype.
Hi All , The below sample caught my attention i.e int a ; unsigned int b; int func() { return a =b; } the compiler didn't warn me about the signed mismatch in the above case. where as int *a ; unsigned int *b; int func() { a =b; return *a; } compiler warns me as warning: pointer targets in assignment differ in signedness [-Wpointer-sign] I’m bit confused or i'm missing something here . any thoughts ?? Thanks ~Umesh
Re: x86 gcc lacks simple optimization
On Fri, Dec 6, 2013 at 2:25 AM, Richard Biener wrote: > On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov > wrote: >> Hi, >> >> nothing changes if everything is unsigned and we are guaranteed to not >> raise UB on overflow: >> >> unsigned foo(unsigned char *t, unsigned char *v, unsigned w) >> { >> unsigned i; >> >> for (i = 1; i != w; ++i) >> { >> unsigned x = i << 2; >> v[x + 4] = t[x + 4]; >> } >> >> return 0; >> } >> >> yields: >> >> .L5: >> leal 0(,%eax,4), %edx >> addl $1, %eax >> movzbl 4(%edi,%edx), %ecx >> cmpl %ebx, %eax >> movb %cl, 4(%esi,%edx) >> jne .L5 >> >> What is SCEV infrastructure (guessing scalar evolutions?) and what >> files/passes to look in? > > tree-scalar-evolution.c, look at where it handles MULT_EXPR but > lacks LSHIFT_EXPR support. > For -- int foo(char *t, char *v, int w) { int i; for (i = 1; i != w; ++i) { int x = i * 2; v[x + 4] = t[x + 4]; } return 0; } --- -O2 gives: .L6: movzbl4(%esi,%eax,2), %edx movb%dl, 4(%ebx,%eax,2) addl$1, %eax cmpl%ecx, %eax jne.L6 -- H.J.
Re: [RFC] Vectorization of indexed elements
On Wed, 4 Dec 2013, Vidya Praveen wrote: > Hi Richi, > > Apologies for the late response. I was on vacation. > > On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote: > > > void > > > foo (int *__restrict__ a, > > > int *__restrict__ b, > > > int c) > > > { > > > int i; > > > > > > for (i = 0; i < 8; i++) > > > a[i] = b[i] * c; > > > } > > > > Both cases can be handled by patterns that match > > > > (mul:VXSI (reg:VXSI > > (vec_duplicate:VXSI reg:SI))) > > How do I arrive at this pattern in the first place? Assuming vec_init with > uniform values are expanded as vec_duplicate, it will still be two > expressions. > > That is, > > (set reg:VXSI (vec_duplicate:VXSI (reg:SI))) > (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI))) Yes, but then combine comes along and creates (set reg:VXSI (mul:VXSI (reg:VXSI (vec_duplicate:VXSI (reg:SI) which matches one of your define_insn[_and_split]s. > > You'd then "consume" the vec_duplicate and implement it as > > load scalar into element zero of the vector and use index mult > > with index zero. > > If I understand this correctly, you are suggesting to leave the scalar > load from memory as it is but treat the > > (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI))) > > as > > load reg:VXSI[0], reg:SI > mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register > perhaps > > either by generating instructions directly or by using define_split. Am I > right? Possibly. Or allow memory as operand 2 for your pattern (so, not reg:SI but mem:SI). Combine should be happy with that, too. > If I'm right, then my concern is that it may be possible to simplify this > further > by loading directly to a indexed vector register from memory but it's too > late at > this point for such simplification to be possible. > > Please let me know what am I not understanding. Not sure. Did you try it? Richard.
Re: Make SImode as default mode for INT type.
Umesh Kalappa writes: > Tried to compile the below sample by retargeted compiler > > int a =10; > > int b =10; > > > int func() > > { > > return a+ b; > > } > > the compiler is stating that the a and b are global with short type(HI > mode) of size 2 bytes. Yeah, HImode sounds right in this case. > where as we need the word mode as SI not HI ,I do understand that the > SI and HI modes are of same size but till I insist better to have SI > mode. For a 16-bit target, word_mode should be HImode rather than SImode. The way the modes are defined is that QImode is always one "unit" (byte), HImode is always twice the size of QImode, SImode is always twice the size of HImode, and so on. Other modes like word_mode, Pmode and ptr_mode are defined relative to that. Thanks, Richard
Re: x86 gcc lacks simple optimization
On Fri, 6 Dec 2013, Konstantin Vladimirov wrote: Consider code: int foo(char *t, char *v, int w) { int i; for (i = 1; i != w; ++i) { int x = i << 2; A side note, but something too few people seem to be aware of: writing i<<2 can pessimize code compared to i*4 (and it is never faster). That is because, at a high level, signed multiplication overflow is undefined behavior while shift isn't. At a low level, gcc knows it can implement *4 as a shift anyway. v[x + 4] = t[x + 4]; } return 0; } -- Marc Glisse
Re: x86 gcc lacks simple optimization
Hi, Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c and now it yields code like (x86 again): .L5: movzbl 4(%esi,%eax,4), %edx movb %dl, 4(%ebx,%eax,4) addl $1, %eax cmpl %ecx, %eax jne .L5 So, excessive lea is gone. It is great, thank you so much. But I wonder what else can I do to move add upper to simplify memory accesses (I am guessing, this is some arithmetical re-associations, still not sure where to look). For architecture, I am working on, it is important. What would you advise? --- With best regards, Konstantin On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener wrote: > On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov > wrote: >> Hi, >> >> nothing changes if everything is unsigned and we are guaranteed to not >> raise UB on overflow: >> >> unsigned foo(unsigned char *t, unsigned char *v, unsigned w) >> { >> unsigned i; >> >> for (i = 1; i != w; ++i) >> { >> unsigned x = i << 2; >> v[x + 4] = t[x + 4]; >> } >> >> return 0; >> } >> >> yields: >> >> .L5: >> leal 0(,%eax,4), %edx >> addl $1, %eax >> movzbl 4(%edi,%edx), %ecx >> cmpl %ebx, %eax >> movb %cl, 4(%esi,%edx) >> jne .L5 >> >> What is SCEV infrastructure (guessing scalar evolutions?) and what >> files/passes to look in? > > tree-scalar-evolution.c, look at where it handles MULT_EXPR but > lacks LSHIFT_EXPR support. > > Richard. > >> --- >> With best regards, Konstantin >> >> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener >> wrote: >>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov >>> wrote: Hi, Consider code: int foo(char *t, char *v, int w) { int i; for (i = 1; i != w; ++i) { int x = i << 2; v[x + 4] = t[x + 4]; } return 0; } Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: gcc -O2 -m32 -S test.c You will see loop, formed like: .L5: leal 0(,%eax,4), %edx addl $1, %eax movzbl 4(%edi,%edx), %ecx cmpl %ebx, %eax movb %cl, 4(%esi,%edx) jne .L5 But it can be easily simplified to something like this: .L5: addl $1, %eax movzbl (%esi,%eax,4), %edx cmpl %ecx, %eax movb %dl, (%ebx,%eax,4) jne .L5 (i.e. left shift may be moved to address). First question to gcc-help maillist. May be there are some options, that I've missed, and there IS a way to explain gcc my intention to do this? And second question to gcc developers mail list. I am working on private backend and want to add this optimization to my backend. What do you advise me to do -- custom gimple pass, or rtl pass, or modify some existent pass, etc? >>> >>> This looks like a deficiency in induction variable optimization. Note >>> that i << 2 may overflow and this overflow does not invoke undefined >>> behavior but is in the implementation defined behavior category. >>> >>> The issue in this case is likely that the SCEV infrastructure does not >>> handle >>> left-shifts. >>> >>> Richard. >>> --- With best regards, Konstantin
Oleg Endo appointed co-maintainer of SH port
I am pleased to announce that the GCC Steering Committee has appointed Oleg Endo as co-maintainer of the SH port. Please join me in congratulating Oleg on his new role. Oleg, please update your listing in the MAINTAINERS file. Happy hacking! David
Re: x86 gcc lacks simple optimization
On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov wrote: > Hi, > > Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c > and now it yields code like (x86 again): > > .L5: > movzbl 4(%esi,%eax,4), %edx > movb %dl, 4(%ebx,%eax,4) > addl $1, %eax > cmpl %ecx, %eax > jne .L5 > > So, excessive lea is gone. It is great, thank you so much. But I > wonder what else can I do to move add upper to simplify memory > accesses (I am guessing, this is some arithmetical re-associations, > still not sure where to look). For architecture, I am working on, it > is important. What would you advise? You need to look at IVOPTs and how it arrives at the choice of induction variables. Richard. > --- > With best regards, Konstantin > > On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener > wrote: >> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov >> wrote: >>> Hi, >>> >>> nothing changes if everything is unsigned and we are guaranteed to not >>> raise UB on overflow: >>> >>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w) >>> { >>> unsigned i; >>> >>> for (i = 1; i != w; ++i) >>> { >>> unsigned x = i << 2; >>> v[x + 4] = t[x + 4]; >>> } >>> >>> return 0; >>> } >>> >>> yields: >>> >>> .L5: >>> leal 0(,%eax,4), %edx >>> addl $1, %eax >>> movzbl 4(%edi,%edx), %ecx >>> cmpl %ebx, %eax >>> movb %cl, 4(%esi,%edx) >>> jne .L5 >>> >>> What is SCEV infrastructure (guessing scalar evolutions?) and what >>> files/passes to look in? >> >> tree-scalar-evolution.c, look at where it handles MULT_EXPR but >> lacks LSHIFT_EXPR support. >> >> Richard. >> >>> --- >>> With best regards, Konstantin >>> >>> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener >>> wrote: On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov wrote: > Hi, > > Consider code: > > int foo(char *t, char *v, int w) > { > int i; > > for (i = 1; i != w; ++i) > { > int x = i << 2; > v[x + 4] = t[x + 4]; > } > > return 0; > } > > Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options: > > gcc -O2 -m32 -S test.c > > You will see loop, formed like: > > .L5: > leal 0(,%eax,4), %edx > addl $1, %eax > movzbl 4(%edi,%edx), %ecx > cmpl %ebx, %eax > movb %cl, 4(%esi,%edx) > jne .L5 > > But it can be easily simplified to something like this: > > .L5: > addl $1, %eax > movzbl (%esi,%eax,4), %edx > cmpl %ecx, %eax > movb %dl, (%ebx,%eax,4) > jne .L5 > > (i.e. left shift may be moved to address). > > First question to gcc-help maillist. May be there are some options, > that I've missed, and there IS a way to explain gcc my intention to do > this? > > And second question to gcc developers mail list. I am working on > private backend and want to add this optimization to my backend. What > do you advise me to do -- custom gimple pass, or rtl pass, or modify > some existent pass, etc? This looks like a deficiency in induction variable optimization. Note that i << 2 may overflow and this overflow does not invoke undefined behavior but is in the implementation defined behavior category. The issue in this case is likely that the SCEV infrastructure does not handle left-shifts. Richard. > --- > With best regards, Konstantin
Re: [Warning] Signed mistach for basic datatype.
On 12/06/2013 10:41 AM, Umesh Kalappa wrote: > I’m bit confused or i'm missing something here . The first of these is implementation-defined behaviour, and the second is (potentially) undefined behaviour. This is more of a generic C question than a GCC question. Andrew.
Re: Make SImode as default mode for INT type.
On Dec 6, 2013, at 5:40 AM, Umesh Kalappa wrote: > Hi all, > > We are re-targeting the gcc 4.8.1 to the 16 bit core ,where word =int > = short = pointer= 16 , char = 8 bit and long =32 bit. > > We model the above requirement as > > #define BITS_PER_UNIT 8 > > #define BITS_PER_WORD 16 > > #define UNITS_PER_WORD 2 > > #define POINTER_SIZE16 > > #define SHORT_TYPE_SIZE 16 > > #define INT_TYPE_SIZE 16 > > #define LONG_TYPE_SIZE 32 > > #define FLOAT_TYPE_SIZE 16 > > #define DOUBLE_TYPE_SIZE32 > > Tried to compile the below sample by retargeted compiler > > int a =10; > > int b =10; > > > int func() > > { > > return a+ b; > > } > > the compiler is stating that the a and b are global with short type(HI > mode) of size 2 bytes. > > where as we need the word mode as SI not HI ,I do understand that the > SI and HI modes are of same size but till I insist better to have SI > mode. > > Please somebody or expert in the group share their thought on the > same like how do we can achieve this ? > > Thanks > ~Umesh As Richard mentioned, SImode is not "the mode for int" but rather "the mode for the type that's 4x the size of QImode". So in your case, that would be the mode for "long" and HImode is the mode for "int". Apart from the float and double sizes, what you describe is just like the pdp11 target. And indeed that target has int == HImode as expected. paul
Re: Hmmm, I think we've seen this problem before (lto build):
On Fri, Dec 06, 2013 at 10:47:00AM +0100, Richard Biener wrote: > On Fri, Dec 6, 2013 at 5:47 AM, Trevor Saunders wrote: > > On Mon, Dec 02, 2013 at 12:16:18PM +0100, Richard Biener wrote: > >> On Sun, Dec 1, 2013 at 12:30 PM, Toon Moene wrote: > >> > http://gcc.gnu.org/ml/gcc-testresults/2013-12/msg1.html > >> > > >> > FAILED: Bootstrap (build config: lto; languages: fortran; trunk revision > >> > 205557) on x86_64-unknown-linux-gnu > >> > > >> > In function 'release', > >> > inlined from 'release' at /home/toon/compilers/gcc/gcc/vec.h:1428:3, > >> > inlined from '__base_dtor ' at > >> > /home/toon/compilers/gcc/gcc/vec.h:1195:0, > >> > inlined from 'compute_antic_aux' at > >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2212:0, > >> > inlined from 'compute_antic' at > >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2493:0, > >> > inlined from 'do_pre' at > >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4738:23, > >> > inlined from 'execute' at > >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4818:0: > >> > /home/toon/compilers/gcc/gcc/vec.h:312:3: error: attempt to free a > >> > non-heap > >> > object 'worklist' [-Werror=free-nonheap-object] > >> >::free (v); > >> >^ > >> > lto1: all warnings being treated as errors > >> > make[4]: *** [/dev/shm/wd26755/cczzGuTZ.ltrans13.ltrans.o] Error 1 > >> > make[4]: *** Waiting for unfinished jobs > >> > lto-wrapper: make returned 2 exit status > >> > /usr/bin/ld: lto-wrapper failed > >> > collect2: error: ld returned 1 exit status > >> > >> Yes, I still see this - likely caused by IPA-CP / partial inlining and a > >> "bogus" > >> warning for unreachable code. > > > > I'm really sorry about long delay here, I took a week off for > > thanksgiving then was pretty busy with other stuff :/ > > > > If I remove the now useless worklist.release (); on line 2211 of > > tree-ssa-pre.c lto bootstrap gets passed this issue to a stage 2 / 3 > > comparison failure. However doing that also causes these two test > > failures in a normal bootstrap / regression test cycle > > > > Tests that now fail, but worked before: > > > > unix/-m32: 17_intro/headers/c++200x/stdc++.cc (test for excess errors) > > unix/-m32: 17_intro/headers/c++200x/stdc++_multiple_inclusion.cc (test > > for excess errors) > > > > both of these failures are because of this ICE > > This must be unrelated - please go ahead and install the patch removing > the useless worklist.release () from tree-ssa-pre.c done r205750 sorry about the breakage. Trev > > Thanks, > Richard. > > > Executing on host: /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ > > -shared-libgcc -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++ > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs > > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/ > > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/ > > -isystem > > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include > > -isystem > > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include > > -m32 > > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0 > > -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2 > > -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++ > > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu > > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include > > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsupc++ > > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/include/backward > > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/util > > /tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/17_intro/headers/c++200x/stdc++_multiple_inclusion.cc > > -std=gnu++0x -S -m32 -o stdc++_multiple_inclusion.s(timeout = 600) > > spawn /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ -shared-libgcc > > -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++ > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs > > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/ > > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/ > > -isystem > > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include > > -isystem > > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include > > -m32 > > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs > > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0 > > -ffunction-sections -fdata-sections -
Re: Oleg Endo appointed co-maintainer of SH port
On Fri, 2013-12-06 at 09:05 -0500, David Edelsohn wrote: > I am pleased to announce that the GCC Steering Committee has > appointed Oleg Endo as co-maintainer of the SH port. > > Please join me in congratulating Oleg on his new role. > Oleg, please update your listing in the MAINTAINERS file. Thank you. I've just committed the following. Index: MAINTAINERS === --- MAINTAINERS (revision 205756) +++ MAINTAINERS (working copy) @@ -102,6 +102,7 @@ score port Chen Liqin liqin@gmail.com sh portAlexandre Oliva aol...@redhat.com sh portKaz Kojima kkoj...@gcc.gnu.org +sh portOleg Endo olege...@gcc.gnu.org sparc port Richard Henderson r...@redhat.com sparc port David S. Miller da...@redhat.com sparc port Eric Botcazou ebotca...@libertysurf.fr @@ -364,7 +365,6 @@ Bernd Edlinger bernd.edlin...@hotmail.de Phil Edwards p...@gcc.gnu.org Mohan Embargnust...@thisiscool.com -Oleg Endo olege...@gcc.gnu.org Revital Eres e...@il.ibm.com Marc Espie es...@cvs.openbsd.org Rafael �vila de Esp�ndola espind...@google.com
Re: x86 gcc lacks simple optimization
On 12/06/13 07:17, Richard Biener wrote: On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov wrote: Hi, Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c and now it yields code like (x86 again): .L5: movzbl 4(%esi,%eax,4), %edx movb %dl, 4(%ebx,%eax,4) addl $1, %eax cmpl %ecx, %eax jne .L5 So, excessive lea is gone. It is great, thank you so much. But I wonder what else can I do to move add upper to simplify memory accesses (I am guessing, this is some arithmetical re-associations, still not sure where to look). For architecture, I am working on, it is important. What would you advise? You need to look at IVOPTs and how it arrives at the choice of induction variables. Konstantin, You might want to work will Ben Cheng from ARM; he's already poking in this code and may have some ideas. jeff
Re: [RFC, LRA] Repeated looping over subreg reloads.
On 12/5/2013, 9:35 AM, Tejas Belagod wrote: Vladimir Makarov wrote: On 12/4/2013, 6:15 AM, Tejas Belagod wrote: Hi, I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all mode changes on FP_REGS as aarch64 does not have register-packing, but I'm running into an LRA ICE. A test case generates an RTL subreg of the following form (set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8)) LRA has to reload the subreg because the subreg is not representable as a full register. When LRA reloads this in lra-constraints.c:simplyfy_operand_subreg (), it seems to reload SUBREG_REG() and leave the byte offset alone. i.e. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) The code in lra-constraints.c is this conditional: /* Force a reload of the SUBREG_REG if this is a constant or PLUS or if there may be a problem accessing OPERAND in the outer mode. */ if ((REG_P (reg) insert_move_for_subreg (insert_before ? &before : NULL, insert_after ? &after : NULL, reg, new_reg); } What happens subsequently is that LRA keeps looping over this RTL and keeps reloading the SUBREG_REG() till the limit of constraint passes is reached. (set (reg:V2DF 100) (reg:V2DF 95)) (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8)) I can't see any place where this subreg is resolved (eg. into equiv memref) before the next iteration comes around for reloading the inputs and outputs of curr_insn. Or am I missing something some part of code that tries reloading the subreg with different alternatives or reg classes? I guess this behaviour is wrong. We could spill the V2DF pseudo or put it into another class reg. But it is not implemented. This code is actually a modified version of reload pass one. We could implement alternative strategies and a check for potential loop (such code exists in process_alt_operands). Could you send me the macro change and the test. I'll look at it and figure out what can we do. Hi, Thanks for looking at this. The macro change is in this patch http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64, ICEs: gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler error: Maximum number of LRA constraint passes is achieved (30) Also, I'm curious to know - is it possible to vec_extract for vector mode subregs and zero/sign extract for scalars and spilling be the last resort if either of these are not possible? As you say, non-zero SUBREG_BYTE offset could also be resolved using a different regclass where the sub-mode could just be a full-register. Here is the patch which solves the problem. Right now it is only spilling but it is the best what can be done for this case. I'll submit the patch on the next week after better testing on different platforms. Vec_extract is interesting but it is a rare case which needs a lot of code to implement this. I think we need more general approach called bitwidth-aware RA (putting several pseudo values into regs, e.g vec regs). Although I don't know will it help for arm64 cpus. Last time i checked manually bitwidth-aware RA for intel cpus, it makes code bigger and slower. If there is a mainstream processor for which it can improve performance, i'd put it in my higher priority list to do. Index: lra-constraints.c === --- lra-constraints.c (revision 205449) +++ lra-constraints.c (working copy) @@ -1237,9 +1237,20 @@ simplify_operand_subreg (int nop, enum m && ! LRA_SUBREG_P (operand)) || CONSTANT_P (reg) || GET_CODE (reg) == PLUS || MEM_P (reg)) { - /* The class will be defined later in curr_insn_transform. */ - enum reg_class rclass - = (enum reg_class) targetm.preferred_reload_class (reg, ALL_REGS); + enum reg_class rclass; + + if (REG_P (reg) + && curr_insn_set != NULL_RTX + && (REG_P (SET_SRC (curr_insn_set)) + || GET_CODE (SET_SRC (curr_insn_set)) == SUBREG)) + /* There is big probability that we will get the same class + for the new pseudo and we will get the same insn which + means infinite looping. So spill the new pseudo. */ + rclass = NO_REGS; + else + /* The class will be defined later in curr_insn_transform. */ + rclass + = (enum reg_class) targetm.preferred_reload_class (reg, ALL_REGS); if (get_reload_reg (curr_static_id->operand[nop].type, reg_mode, reg, rclass, "subreg reg", &new_reg))