x86 gcc lacks simple optimization

2013-12-06 Thread Konstantin Vladimirov
Hi,

Consider code:

int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
int x = i << 2;
v[x + 4] = t[x + 4];
}

return 0;
}

Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:

gcc -O2 -m32 -S test.c

You will see loop, formed like:

.L5:
leal 0(,%eax,4), %edx
addl $1, %eax
movzbl 4(%edi,%edx), %ecx
cmpl %ebx, %eax
movb %cl, 4(%esi,%edx)
jne .L5

But it can be easily simplified to something like this:

.L5:
addl $1, %eax
movzbl (%esi,%eax,4), %edx
cmpl %ecx, %eax
movb %dl, (%ebx,%eax,4)
jne .L5

(i.e. left shift may be moved to address).

First question to gcc-help maillist. May be there are some options,
that I've missed, and there IS a way to explain gcc my intention to do
this?

And second question to gcc developers mail list. I am working on
private backend and want to add this optimization to my backend. What
do you advise me to do -- custom gimple pass, or rtl pass, or modify
some existent pass, etc?

---
With best regards, Konstantin


RE: Controling reloads of movsicc pattern

2013-12-06 Thread BELBACHIR Selim
Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc 
4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the 
moment (i.e. disable wide offset MEM on conditional moves).
Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or maybe LRA 
will be merged in 4.7.4 ?)

Thanks

Selim

-Message d'origine-
De : Jeff Law [mailto:l...@redhat.com] 
Envoyé : mercredi 4 décembre 2013 18:02
À : BELBACHIR Selim; gcc@gcc.gnu.org
Objet : Re: Controling reloads of movsicc pattern

On 12/04/13 03:22, BELBACHIR Selim wrote:
> Hi,
>
> My target has :
> - 2 registers class to store SImode (like m68k, data $D & address $A).
> - moves from wide offset MEM to $D or $A   (ex: mov d($A1+50),$A2   ormov 
> d($A1+50),$D1)
> - conditional moves from offset MEM to $D or $A but with a restriction :
>   offset MEM conditionally moved to $A has a limited offset of 
> 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ 
> d($A1,50),$D1)
>
> The predicate of movsicc pattern tells GCC that wide offset MEM is allowed 
> and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and 
> 'restricted offset MEM -> $A" :
>
> (define_insn_and_split "movsicc_internal"
>[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m")
>  (if_then_else:SI
>(match_operator 1 "prism_comparison_operator"
> [(match_operand 4 "cc_register" "") (const_int 0)])
>(match_operand:SI 2 "nonimmediate_operand"   " v,m,r,0,0,0,v,m,r") 
> ;; "v" constraint is for restricted offset MEM
>(match_operand:SI 3 " nonimmediate_operand" " 
> 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match 
> the other alternatives
>
>
>
> I encountered : (on gcc4.7.3)
>
> core_main.c:354:1: error: insn does not satisfy its constraints:
> (insn 1176 1175 337 26 (set (reg:SI 5 $A5)
>  (if_then_else:SI (ne (reg:CC 56 $CCI)
>  (const_int 0 [0]))
>  (mem/c:SI (plus:SI (reg/f:SI 0 $A0)
>  (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32])
>  (const_int 1 [0x1]))) core_main.c:211:32 158 
> {movsicc_internal}
>
> Due to reload pass (core_main.c.199r.reload).
>
>
> How can I tune reload or write my movsicc pattern to prevent reload pass from 
> generating a conditional move from wide offset MEM to $A registers ??
If at all possible, I would recommend switching to LRA.  There's an up-front 
cost, but it's definitely the direction all ports should be heading.  Avoiding 
reload is, umm, good.

jeff



Re: Truncate optimisation question

2013-12-06 Thread Richard Sandiford
Eric Botcazou  writes:
>> Well, I think making the simplify-rtx code conditional on the target
>> would be the wrong way to go.  If we really can't live with it being
>> unconditional then I think we should revert it.  But like I say I think
>> it would be better to make combine recognise the redundancy even with
>> the new form.  (Or as I say, longer term, not to rely on combine to
>> eliminate redundant extensions.)  But I don't have time to do that myself...
>
> It helps x86 so we won't revert it.  My fear is that we'll need to add code 
> in 
> other places to RISCify back the result of this "simplification".

Sorry, realised I didn't respond to this yesterday.  I wasn't suggesting
we just revert and walk away.  ISTR the original suggestion was to patch
combine instead of simplify-rtx.c, so we could back to that.

Thanks,
Richard


Re: Dependency confusion in sched-deps

2013-12-06 Thread shmeel gutl

On 06-Dec-13 01:34 AM, Maxim Kuvyrkov wrote:

On 6/12/2013, at 8:44 am, shmeel gutl  wrote:


On 05-Dec-13 02:39 AM, Maxim Kuvyrkov wrote:

Dependency type plays a role for estimating costs and latencies between 
instructions (which affects performance), but using wrong or imprecise 
dependency type does not affect correctness.

On multi-issue architectures it does make a difference. Anti dependence permits 
the two instructions to be issued during the same cycle whereas true dependency 
and output dependency would forbid this.

Or am I misinterpreting your comment?

On VLIW-flavoured machines without resource conflict checking -- "yes", it is 
critical not to use anti dependency where an output or true dependency exist.  This is 
the case though, only because these machines do not follow sequential semantics for 
instruction execution (i.e., effects from previous instructions are not necessarily 
observed by subsequent instructions on the same/close cycles.

On machines with internal resource conflict checking having a wrong type on the 
dependency should not cause wrong behavior, but "only" suboptimal performance.

Thank you,

--
Maxim Kuvyrkov
www.kugelworks.com


Earlier in the thread you wrote

Output dependency is the right type (write after write).  Anti dependency is 
write after read, and true dependency is read after write.
Should the code be changed to accommodate vliw machines.. It has been 
there since the module was originally checked into trunk.




Re: Controling reloads of movsicc pattern

2013-12-06 Thread Andrew Pinski
On Fri, Dec 6, 2013 at 12:41 AM, BELBACHIR Selim
 wrote:
> Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc 
> 4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the 
> moment (i.e. disable wide offset MEM on conditional moves).
> Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or maybe LRA 
> will be merged in 4.7.4 ?)

If this is the Ada front-end, then it is already part of 4.8 release.
Or is this some other front-end?

Thanks,
Andrew Pinski


>
> Thanks
>
> Selim
>
> -Message d'origine-
> De : Jeff Law [mailto:l...@redhat.com]
> Envoyé : mercredi 4 décembre 2013 18:02
> À : BELBACHIR Selim; gcc@gcc.gnu.org
> Objet : Re: Controling reloads of movsicc pattern
>
> On 12/04/13 03:22, BELBACHIR Selim wrote:
>> Hi,
>>
>> My target has :
>> - 2 registers class to store SImode (like m68k, data $D & address $A).
>> - moves from wide offset MEM to $D or $A   (ex: mov d($A1+50),$A2   or
>> mov d($A1+50),$D1)
>> - conditional moves from offset MEM to $D or $A but with a restriction :
>>   offset MEM conditionally moved to $A has a limited offset of
>> 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ
>> d($A1,50),$D1)
>>
>> The predicate of movsicc pattern tells GCC that wide offset MEM is allowed 
>> and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and 
>> 'restricted offset MEM -> $A" :
>>
>> (define_insn_and_split "movsicc_internal"
>>[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m")
>>  (if_then_else:SI
>>(match_operator 1 "prism_comparison_operator"
>> [(match_operand 4 "cc_register" "") (const_int 0)])
>>(match_operand:SI 2 "nonimmediate_operand"   " 
>> v,m,r,0,0,0,v,m,r") ;; "v" constraint is for restricted offset MEM
>>(match_operand:SI 3 " nonimmediate_operand" "
>> 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match
>> the other alternatives
>>
>>
>>
>> I encountered : (on gcc4.7.3)
>>
>> core_main.c:354:1: error: insn does not satisfy its constraints:
>> (insn 1176 1175 337 26 (set (reg:SI 5 $A5)
>>  (if_then_else:SI (ne (reg:CC 56 $CCI)
>>  (const_int 0 [0]))
>>  (mem/c:SI (plus:SI (reg/f:SI 0 $A0)
>>  (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32])
>>  (const_int 1 [0x1]))) core_main.c:211:32 158
>> {movsicc_internal}
>>
>> Due to reload pass (core_main.c.199r.reload).
>>
>>
>> How can I tune reload or write my movsicc pattern to prevent reload pass 
>> from generating a conditional move from wide offset MEM to $A registers ??
> If at all possible, I would recommend switching to LRA.  There's an up-front 
> cost, but it's definitely the direction all ports should be heading.  
> Avoiding reload is, umm, good.
>
> jeff
>


Re: x86 gcc lacks simple optimization

2013-12-06 Thread David Brown
On 06/12/13 09:30, Konstantin Vladimirov wrote:
> Hi,
> 
> Consider code:
> 
> int foo(char *t, char *v, int w)
> {
> int i;
> 
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
> 
> return 0;
> }
> 
> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
> 
> gcc -O2 -m32 -S test.c
> 
> You will see loop, formed like:
> 
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
> 
> But it can be easily simplified to something like this:
> 
> .L5:
> addl $1, %eax
> movzbl (%esi,%eax,4), %edx
> cmpl %ecx, %eax
> movb %dl, (%ebx,%eax,4)
> jne .L5
> 
> (i.e. left shift may be moved to address).
> 
> First question to gcc-help maillist. May be there are some options,
> that I've missed, and there IS a way to explain gcc my intention to do
> this?
> 
> And second question to gcc developers mail list. I am working on
> private backend and want to add this optimization to my backend. What
> do you advise me to do -- custom gimple pass, or rtl pass, or modify
> some existent pass, etc?
> 

Hi,

Usually the gcc developers are not keen on emails going to both the help
and development list - they prefer to keep them separate.

My first thought when someone finds a "missed optimisation" issue,
especially with the x86 target, is are you /sure/ this code is slower?
x86 chips are immensely complex, and the interplay between different
instructions, pipelines, superscaling, etc., means that code that might
appear faster, can actually be slower.  So please check your
architecture flags (i.e., are you optimising for the "native" cpu, or
any other specific cpu - optimised code can be different for different
x86 cpus).  Then /measure/ the speed of the code to see if there is a
real difference.


Regarding your "private backend" - is this a modification of the x86
backend, or a completely different target?  If it is x86, then I think
the answer is "don't do it - work with the mainline code".  If it is
something else, then an x86-specific optimisation is of little use anyway.

mvh.,

David





RE: Controling reloads of movsicc pattern

2013-12-06 Thread BELBACHIR Selim
Any gnat official release ? Maybe gnat 7.2 beta is based on 4.8, I'll try to 
get this one.


-Message d'origine-
De : Andrew Pinski [mailto:pins...@gmail.com] 
Envoyé : vendredi 6 décembre 2013 09:54
À : BELBACHIR Selim
Cc : Jeff Law; gcc@gcc.gnu.org
Objet : Re: Controling reloads of movsicc pattern

On Fri, Dec 6, 2013 at 12:41 AM, BELBACHIR Selim 
 wrote:
> Hum, I can't change gcc branch because I'm tighted to gnat 7.1.2 based on gcc 
> 4.7.3 (I saw that LRA was merged in 4.8). I will use a workaround for the 
> moment (i.e. disable wide offset MEM on conditional moves).
> Does someone know if gnat frontend will rebase on 4.8 soon :) ? (or 
> maybe LRA will be merged in 4.7.4 ?)

If this is the Ada front-end, then it is already part of 4.8 release.
Or is this some other front-end?

Thanks,
Andrew Pinski


>
> Thanks
>
> Selim
>
> -Message d'origine-
> De : Jeff Law [mailto:l...@redhat.com]
> Envoyé : mercredi 4 décembre 2013 18:02 À : BELBACHIR Selim; 
> gcc@gcc.gnu.org Objet : Re: Controling reloads of movsicc pattern
>
> On 12/04/13 03:22, BELBACHIR Selim wrote:
>> Hi,
>>
>> My target has :
>> - 2 registers class to store SImode (like m68k, data $D & address $A).
>> - moves from wide offset MEM to $D or $A   (ex: mov d($A1+50),$A2   or
>> mov d($A1+50),$D1)
>> - conditional moves from offset MEM to $D or $A but with a restriction :
>>   offset MEM conditionally moved to $A has a limited offset 
>> of
>> 0 or 1 (ex: mov.ifEQ d($A1,1),$A1 whereas we can still do mov.ifEQ
>> d($A1,50),$D1)
>>
>> The predicate of movsicc pattern tells GCC that wide offset MEM is allowed 
>> and constraints describe 2 alternatives for 'wide offset MEM -> $D ' and 
>> 'restricted offset MEM -> $A" :
>>
>> (define_insn_and_split "movsicc_internal"
>>[(set (match_operand:SI 0 "register_operand" "=a,d,m,a,d,m,a,d,m")
>>  (if_then_else:SI
>>(match_operator 1 "prism_comparison_operator"
>> [(match_operand 4 "cc_register" "") (const_int 0)])
>>(match_operand:SI 2 "nonimmediate_operand"   " 
>> v,m,r,0,0,0,v,m,r") ;; "v" constraint is for restricted offset MEM
>>(match_operand:SI 3 " nonimmediate_operand" "
>> 0,0,0,v,m,r,v,m,r")))] ;; the last 3 alternatives are split to match 
>> the other alternatives
>>
>>
>>
>> I encountered : (on gcc4.7.3)
>>
>> core_main.c:354:1: error: insn does not satisfy its constraints:
>> (insn 1176 1175 337 26 (set (reg:SI 5 $A5)
>>  (if_then_else:SI (ne (reg:CC 56 $CCI)
>>  (const_int 0 [0]))
>>  (mem/c:SI (plus:SI (reg/f:SI 0 $A0)
>>  (const_int 2104 [0x838])) [9 %sfp+2104 S4 A32])
>>  (const_int 1 [0x1]))) core_main.c:211:32 158 
>> {movsicc_internal}
>>
>> Due to reload pass (core_main.c.199r.reload).
>>
>>
>> How can I tune reload or write my movsicc pattern to prevent reload pass 
>> from generating a conditional move from wide offset MEM to $A registers ??
> If at all possible, I would recommend switching to LRA.  There's an up-front 
> cost, but it's definitely the direction all ports should be heading.  
> Avoiding reload is, umm, good.
>
> jeff
>


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Konstantin Vladimirov
Hi,

Example from x86 code was only for ease of reproduction. I am pretty
sure, this is architecture-independent issue. Say on ARM:

.L2:
mov ip, r3, asl #2
add ip, ip, #4
add r3, r3, #1
ldrb r4, [r0, ip] @ zero_extendqisi2
cmp r3, r2
strb r4, [r1, ip]
bne .L2

May be improved to:

.L2:
add r3, r3, #1
ldrb ip, [r0, r3, asl #2] @ zero_extendqisi2
cmp r3, r2
strb ip, [r1, r3, asl #2]
bne .L2

And so on. I myself feeling more comfortable with x86, but it is only
a matter of taste.

To get improved version of code, I just do by hands what compiler is
expected to do automatically, i.e. rewritten things as:

int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
v[(i + 1) << 2] = t[(i + 1) << 2];
}

return 0;
}

Private backend, I am working on isn't a modification of any, it is
private backend, written from scratch.

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 1:27 PM, David Brown  wrote:
> On 06/12/13 09:30, Konstantin Vladimirov wrote:
>> Hi,
>>
>> Consider code:
>>
>> int foo(char *t, char *v, int w)
>> {
>> int i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> int x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>
>> gcc -O2 -m32 -S test.c
>>
>> You will see loop, formed like:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> But it can be easily simplified to something like this:
>>
>> .L5:
>> addl $1, %eax
>> movzbl (%esi,%eax,4), %edx
>> cmpl %ecx, %eax
>> movb %dl, (%ebx,%eax,4)
>> jne .L5
>>
>> (i.e. left shift may be moved to address).
>>
>> First question to gcc-help maillist. May be there are some options,
>> that I've missed, and there IS a way to explain gcc my intention to do
>> this?
>>
>> And second question to gcc developers mail list. I am working on
>> private backend and want to add this optimization to my backend. What
>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>> some existent pass, etc?
>>
>
> Hi,
>
> Usually the gcc developers are not keen on emails going to both the help
> and development list - they prefer to keep them separate.
>
> My first thought when someone finds a "missed optimisation" issue,
> especially with the x86 target, is are you /sure/ this code is slower?
> x86 chips are immensely complex, and the interplay between different
> instructions, pipelines, superscaling, etc., means that code that might
> appear faster, can actually be slower.  So please check your
> architecture flags (i.e., are you optimising for the "native" cpu, or
> any other specific cpu - optimised code can be different for different
> x86 cpus).  Then /measure/ the speed of the code to see if there is a
> real difference.
>
>
> Regarding your "private backend" - is this a modification of the x86
> backend, or a completely different target?  If it is x86, then I think
> the answer is "don't do it - work with the mainline code".  If it is
> something else, then an x86-specific optimisation is of little use anyway.
>
> mvh.,
>
> David
>
>
>


Re: Hmmm, I think we've seen this problem before (lto build):

2013-12-06 Thread Richard Biener
On Fri, Dec 6, 2013 at 5:47 AM, Trevor Saunders  wrote:
> On Mon, Dec 02, 2013 at 12:16:18PM +0100, Richard Biener wrote:
>> On Sun, Dec 1, 2013 at 12:30 PM, Toon Moene  wrote:
>> > http://gcc.gnu.org/ml/gcc-testresults/2013-12/msg1.html
>> >
>> > FAILED: Bootstrap (build config: lto; languages: fortran; trunk revision
>> > 205557) on x86_64-unknown-linux-gnu
>> >
>> > In function 'release',
>> > inlined from 'release' at /home/toon/compilers/gcc/gcc/vec.h:1428:3,
>> > inlined from '__base_dtor ' at
>> > /home/toon/compilers/gcc/gcc/vec.h:1195:0,
>> > inlined from 'compute_antic_aux' at
>> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2212:0,
>> > inlined from 'compute_antic' at
>> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2493:0,
>> > inlined from 'do_pre' at
>> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4738:23,
>> > inlined from 'execute' at
>> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4818:0:
>> > /home/toon/compilers/gcc/gcc/vec.h:312:3: error: attempt to free a non-heap
>> > object 'worklist' [-Werror=free-nonheap-object]
>> >::free (v);
>> >^
>> > lto1: all warnings being treated as errors
>> > make[4]: *** [/dev/shm/wd26755/cczzGuTZ.ltrans13.ltrans.o] Error 1
>> > make[4]: *** Waiting for unfinished jobs
>> > lto-wrapper: make returned 2 exit status
>> > /usr/bin/ld: lto-wrapper failed
>> > collect2: error: ld returned 1 exit status
>>
>> Yes, I still see this - likely caused by IPA-CP / partial inlining and a 
>> "bogus"
>> warning for unreachable code.
>
> I'm really sorry about long delay here, I took a week off for
> thanksgiving then was pretty busy with other stuff :/
>
> If I remove the now useless  worklist.release (); on line 2211 of
> tree-ssa-pre.c lto bootstrap gets passed this issue to a stage 2 / 3
> comparison failure.  However doing that also causes these two test
> failures in a normal bootstrap / regression test cycle
>
> Tests that now fail, but worked before:
>
> unix/-m32: 17_intro/headers/c++200x/stdc++.cc (test for excess errors)
> unix/-m32: 17_intro/headers/c++200x/stdc++_multiple_inclusion.cc (test
> for excess errors)
>
> both of these failures are because of this ICE

This must be unrelated - please go ahead and install the patch removing
the useless worklist.release () from tree-ssa-pre.c

Thanks,
Richard.

> Executing on host: /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++
> -shared-libgcc -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs
> -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/
> -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/
> -isystem
> /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include
> -isystem
> /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include
> -m32
> -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0
> -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2
> -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++
> -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu
> -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include
> -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsupc++
> -I/tmp/tmp.rsz07gSDni/libstdc++-v3/include/backward
> -I/tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/util
> /tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/17_intro/headers/c++200x/stdc++_multiple_inclusion.cc
> -std=gnu++0x -S  -m32 -o stdc++_multiple_inclusion.s(timeout = 600)
> spawn /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ -shared-libgcc
> -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs
> -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/
> -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/
> -isystem
> /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include
> -isystem
> /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include
> -m32
> -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0
> -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2
> -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++
> -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu
> -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include
> -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsup

Re: Truncate optimisation question

2013-12-06 Thread Richard Biener
On Fri, Dec 6, 2013 at 9:42 AM, Richard Sandiford
 wrote:
> Eric Botcazou  writes:
>>> Well, I think making the simplify-rtx code conditional on the target
>>> would be the wrong way to go.  If we really can't live with it being
>>> unconditional then I think we should revert it.  But like I say I think
>>> it would be better to make combine recognise the redundancy even with
>>> the new form.  (Or as I say, longer term, not to rely on combine to
>>> eliminate redundant extensions.)  But I don't have time to do that myself...
>>
>> It helps x86 so we won't revert it.  My fear is that we'll need to add code 
>> in
>> other places to RISCify back the result of this "simplification".
>
> Sorry, realised I didn't respond to this yesterday.  I wasn't suggesting
> we just revert and walk away.  ISTR the original suggestion was to patch
> combine instead of simplify-rtx.c, so we could back to that.

I think that looks most sensible.

Richard.

> Thanks,
> Richard


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Jakub Jelinek
On Fri, Dec 06, 2013 at 12:30:54PM +0400, Konstantin Vladimirov wrote:
> Consider code:
> 
> int foo(char *t, char *v, int w)
> {
> int i;
> 
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
> 
> return 0;
> }

This is either job of ivopts pass, dunno why it doesn't consider
turning those memory accesses to TARGET_MEM_REF with the *4 multiplication
in there, or combiner (which is limited to single use though, so if you do
say v[x + 4] = t[i]; int he loop instead, it will for -m32 put use
the (...,4) addressing, but as you have two uses, it doesn't do it).
As others said, the question is if using the more complex addressing
more than once is actually beneficial or not.

Anyway, while on this testcase, I wonder why VRP doesn't derive ranges
here.

  :
  # RANGE [-2147483648, 2147483647] NONZERO 0x0fffc
  x_5 = i_1 << 2;
  # RANGE ~[2147483648, 18446744071562067967] NONZERO 0x0fffc
  _6 = (sizetype) x_5;
  # RANGE ~[2147483652, 18446744071562067971] NONZERO 0x0fffc
  _7 = _6 + 4;
  # PT = nonlocal
  _9 = v_8(D) + _7;
  # PT = nonlocal
  _11 = t_10(D) + _7;
  _12 = *_11;
  *_9 = _12;
  i_14 = i_1 + 1;

  :
  # i_1 = PHI <1(2), i_14(3)>
  if (i_1 != w_4(D))
goto ;
  else
goto ;

As i is signed integer with undefined overflow, and
the loop starts with 1 and it is always only incremented, can't
we derive
  # RANGE [1, 2147483647]
  # i_1 = PHI <1(2), i_14(3)>
and similarly for i_14?  We likely can't derive similar range for
x_5 because at least in C++ it isn't undefined behavior if it
shits into negative (or is it?), but at least with
x = i * 4;
instead we could.

Jakub


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Richard Biener
On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
 wrote:
> Hi,
>
> Consider code:
>
> int foo(char *t, char *v, int w)
> {
> int i;
>
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }
>
> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>
> gcc -O2 -m32 -S test.c
>
> You will see loop, formed like:
>
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
>
> But it can be easily simplified to something like this:
>
> .L5:
> addl $1, %eax
> movzbl (%esi,%eax,4), %edx
> cmpl %ecx, %eax
> movb %dl, (%ebx,%eax,4)
> jne .L5
>
> (i.e. left shift may be moved to address).
>
> First question to gcc-help maillist. May be there are some options,
> that I've missed, and there IS a way to explain gcc my intention to do
> this?
>
> And second question to gcc developers mail list. I am working on
> private backend and want to add this optimization to my backend. What
> do you advise me to do -- custom gimple pass, or rtl pass, or modify
> some existent pass, etc?

This looks like a deficiency in induction variable optimization.  Note
that i << 2 may overflow and this overflow does not invoke undefined
behavior but is in the implementation defined behavior category.

The issue in this case is likely that the SCEV infrastructure does not handle
left-shifts.

Richard.

> ---
> With best regards, Konstantin


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Konstantin Vladimirov
Hi,

nothing changes if everything is unsigned and we are guaranteed to not
raise UB on overflow:

unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
{
unsigned i;

for (i = 1; i != w; ++i)
{
unsigned x = i << 2;
v[x + 4] = t[x + 4];
}

return 0;
}

yields:

.L5:
leal 0(,%eax,4), %edx
addl $1, %eax
movzbl 4(%edi,%edx), %ecx
cmpl %ebx, %eax
movb %cl, 4(%esi,%edx)
jne .L5

What is SCEV infrastructure (guessing scalar evolutions?) and what
files/passes to look in?

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
 wrote:
> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>  wrote:
>> Hi,
>>
>> Consider code:
>>
>> int foo(char *t, char *v, int w)
>> {
>> int i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> int x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>
>> gcc -O2 -m32 -S test.c
>>
>> You will see loop, formed like:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> But it can be easily simplified to something like this:
>>
>> .L5:
>> addl $1, %eax
>> movzbl (%esi,%eax,4), %edx
>> cmpl %ecx, %eax
>> movb %dl, (%ebx,%eax,4)
>> jne .L5
>>
>> (i.e. left shift may be moved to address).
>>
>> First question to gcc-help maillist. May be there are some options,
>> that I've missed, and there IS a way to explain gcc my intention to do
>> this?
>>
>> And second question to gcc developers mail list. I am working on
>> private backend and want to add this optimization to my backend. What
>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>> some existent pass, etc?
>
> This looks like a deficiency in induction variable optimization.  Note
> that i << 2 may overflow and this overflow does not invoke undefined
> behavior but is in the implementation defined behavior category.
>
> The issue in this case is likely that the SCEV infrastructure does not handle
> left-shifts.
>
> Richard.
>
>> ---
>> With best regards, Konstantin


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Richard Biener
On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
 wrote:
> Hi,
>
> nothing changes if everything is unsigned and we are guaranteed to not
> raise UB on overflow:
>
> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
> {
> unsigned i;
>
> for (i = 1; i != w; ++i)
> {
> unsigned x = i << 2;
> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }
>
> yields:
>
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
>
> What is SCEV infrastructure (guessing scalar evolutions?) and what
> files/passes to look in?

tree-scalar-evolution.c, look at where it handles MULT_EXPR but
lacks LSHIFT_EXPR support.

Richard.

> ---
> With best regards, Konstantin
>
> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
>  wrote:
>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>>  wrote:
>>> Hi,
>>>
>>> Consider code:
>>>
>>> int foo(char *t, char *v, int w)
>>> {
>>> int i;
>>>
>>> for (i = 1; i != w; ++i)
>>> {
>>> int x = i << 2;
>>> v[x + 4] = t[x + 4];
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>>>
>>> gcc -O2 -m32 -S test.c
>>>
>>> You will see loop, formed like:
>>>
>>> .L5:
>>> leal 0(,%eax,4), %edx
>>> addl $1, %eax
>>> movzbl 4(%edi,%edx), %ecx
>>> cmpl %ebx, %eax
>>> movb %cl, 4(%esi,%edx)
>>> jne .L5
>>>
>>> But it can be easily simplified to something like this:
>>>
>>> .L5:
>>> addl $1, %eax
>>> movzbl (%esi,%eax,4), %edx
>>> cmpl %ecx, %eax
>>> movb %dl, (%ebx,%eax,4)
>>> jne .L5
>>>
>>> (i.e. left shift may be moved to address).
>>>
>>> First question to gcc-help maillist. May be there are some options,
>>> that I've missed, and there IS a way to explain gcc my intention to do
>>> this?
>>>
>>> And second question to gcc developers mail list. I am working on
>>> private backend and want to add this optimization to my backend. What
>>> do you advise me to do -- custom gimple pass, or rtl pass, or modify
>>> some existent pass, etc?
>>
>> This looks like a deficiency in induction variable optimization.  Note
>> that i << 2 may overflow and this overflow does not invoke undefined
>> behavior but is in the implementation defined behavior category.
>>
>> The issue in this case is likely that the SCEV infrastructure does not handle
>> left-shifts.
>>
>> Richard.
>>
>>> ---
>>> With best regards, Konstantin


Make SImode as default mode for INT type.

2013-12-06 Thread Umesh Kalappa
Hi all,

We are re-targeting the gcc 4.8.1 to the 16 bit core ,where word =int
= short = pointer= 16 , char = 8 bit  and long  =32 bit.

We model the above requirement as

#define BITS_PER_UNIT   8

#define BITS_PER_WORD   16

#define UNITS_PER_WORD  2

#define POINTER_SIZE16

#define SHORT_TYPE_SIZE 16

#define INT_TYPE_SIZE   16

#define LONG_TYPE_SIZE  32

#define FLOAT_TYPE_SIZE 16

#define DOUBLE_TYPE_SIZE32

Tried to compile the below sample by retargeted compiler

int a =10;

int b =10;


int func()

{

 return a+ b;

}

the compiler is stating that the a and b are global with short type(HI
mode) of size 2 bytes.

where as we  need the word mode as SI not HI ,I do understand that the
SI and HI modes are of  same size but till I insist  better to have SI
mode.

Please somebody or expert in the  group  share their thought on the
same  like how do we can achieve this ?

Thanks
~Umesh


[Warning] Signed mistach for basic datatype.

2013-12-06 Thread Umesh Kalappa
Hi All ,

The below sample caught my attention i.e

int a ;
unsigned int  b;
int func()
{
return a =b;
}
the compiler didn't warn  me about the signed mismatch in the above case.
where as

int *a ;
unsigned int  *b;
int func()
{
a =b;
return *a;
}
compiler warns me as

warning: pointer targets in assignment differ in signedness [-Wpointer-sign]

I’m bit confused or i'm missing something here .

any thoughts ??

Thanks
~Umesh


Re: x86 gcc lacks simple optimization

2013-12-06 Thread H.J. Lu
On Fri, Dec 6, 2013 at 2:25 AM, Richard Biener
 wrote:
> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
>  wrote:
>> Hi,
>>
>> nothing changes if everything is unsigned and we are guaranteed to not
>> raise UB on overflow:
>>
>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>> {
>> unsigned i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> unsigned x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> yields:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>> files/passes to look in?
>
> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
> lacks LSHIFT_EXPR support.
>

For
--
int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
int x = i * 2;
v[x + 4] = t[x + 4];
}

return 0;
}
---

-O2 gives:

.L6:
movzbl4(%esi,%eax,2), %edx
movb%dl, 4(%ebx,%eax,2)
addl$1, %eax
cmpl%ecx, %eax
jne.L6


-- 
H.J.


Re: [RFC] Vectorization of indexed elements

2013-12-06 Thread Richard Biener
On Wed, 4 Dec 2013, Vidya Praveen wrote:

> Hi Richi,
> 
> Apologies for the late response. I was on vacation.
> 
> On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> > > void
> > > foo (int *__restrict__ a,
> > >  int *__restrict__ b,
> > >  int c)
> > > {
> > >   int i;
> > > 
> > >   for (i = 0; i < 8; i++)
> > > a[i] = b[i] * c;
> > > }
> > 
> > Both cases can be handled by patterns that match
> > 
> >   (mul:VXSI (reg:VXSI
> >  (vec_duplicate:VXSI reg:SI)))
> 
> How do I arrive at this pattern in the first place? Assuming vec_init with
> uniform values are expanded as vec_duplicate, it will still be two 
> expressions.
> 
> That is,
> 
> (set reg:VXSI (vec_duplicate:VXSI (reg:SI)))
> (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI)))

Yes, but then combine comes along and creates

 (set reg:VXSI (mul:VXSI (reg:VXSI (vec_duplicate:VXSI (reg:SI)

which matches one of your define_insn[_and_split]s.

> > You'd then "consume" the vec_duplicate and implement it as
> > load scalar into element zero of the vector and use index mult
> > with index zero.
> 
> If I understand this correctly, you are suggesting to leave the scalar
> load from memory as it is but treat the 
> 
> (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI)))
> 
> as 
> 
> load reg:VXSI[0], reg:SI
> mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register 
> perhaps
> 
> either by generating instructions directly or by using define_split. Am I 
> right?

Possibly.  Or allow memory as operand 2 for your pattern (so, not
reg:SI but mem:SI).  Combine should be happy with that, too.
 
> If I'm right, then my concern is that it may be possible to simplify this 
> further
> by loading directly to a indexed vector register from memory but it's too 
> late at
> this point for such simplification to be possible.
> 
> Please let me know what am I not understanding.

Not sure.  Did you try it?

Richard.


Re: Make SImode as default mode for INT type.

2013-12-06 Thread Richard Sandiford
Umesh Kalappa  writes:
> Tried to compile the below sample by retargeted compiler
>
> int a =10;
>
> int b =10;
>
>
> int func()
>
> {
>
>  return a+ b;
>
> }
>
> the compiler is stating that the a and b are global with short type(HI
> mode) of size 2 bytes.

Yeah, HImode sounds right in this case.

> where as we  need the word mode as SI not HI ,I do understand that the
> SI and HI modes are of  same size but till I insist  better to have SI
> mode.

For a 16-bit target, word_mode should be HImode rather than SImode.

The way the modes are defined is that QImode is always one "unit"
(byte), HImode is always twice the size of QImode, SImode is always
twice the size of HImode, and so on.  Other modes like word_mode,
Pmode and ptr_mode are defined relative to that.

Thanks,
Richard


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Marc Glisse

On Fri, 6 Dec 2013, Konstantin Vladimirov wrote:


Consider code:

int foo(char *t, char *v, int w)
{
int i;

for (i = 1; i != w; ++i)
{
int x = i << 2;


A side note, but something too few people seem to be aware of: writing 
i<<2 can pessimize code compared to i*4 (and it is never faster). That is 
because, at a high level, signed multiplication overflow is undefined 
behavior while shift isn't. At a low level, gcc knows it can implement *4 
as a shift anyway.



v[x + 4] = t[x + 4];
}

return 0;
}


--
Marc Glisse


Re: x86 gcc lacks simple optimization

2013-12-06 Thread Konstantin Vladimirov
Hi,

Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
and now it yields code like (x86 again):

.L5:
movzbl 4(%esi,%eax,4), %edx
movb %dl, 4(%ebx,%eax,4)
addl $1, %eax
cmpl %ecx, %eax
jne .L5

So, excessive lea is gone. It is great, thank you so much. But I
wonder what else can I do to move add upper to simplify memory
accesses (I am guessing, this is some arithmetical re-associations,
still not sure where to look). For architecture, I am working on, it
is important. What would you advise?

---
With best regards, Konstantin

On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener
 wrote:
> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
>  wrote:
>> Hi,
>>
>> nothing changes if everything is unsigned and we are guaranteed to not
>> raise UB on overflow:
>>
>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>> {
>> unsigned i;
>>
>> for (i = 1; i != w; ++i)
>> {
>> unsigned x = i << 2;
>> v[x + 4] = t[x + 4];
>> }
>>
>> return 0;
>> }
>>
>> yields:
>>
>> .L5:
>> leal 0(,%eax,4), %edx
>> addl $1, %eax
>> movzbl 4(%edi,%edx), %ecx
>> cmpl %ebx, %eax
>> movb %cl, 4(%esi,%edx)
>> jne .L5
>>
>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>> files/passes to look in?
>
> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
> lacks LSHIFT_EXPR support.
>
> Richard.
>
>> ---
>> With best regards, Konstantin
>>
>> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
>>  wrote:
>>> On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
>>>  wrote:
 Hi,

 Consider code:

 int foo(char *t, char *v, int w)
 {
 int i;

 for (i = 1; i != w; ++i)
 {
 int x = i << 2;
 v[x + 4] = t[x + 4];
 }

 return 0;
 }

 Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:

 gcc -O2 -m32 -S test.c

 You will see loop, formed like:

 .L5:
 leal 0(,%eax,4), %edx
 addl $1, %eax
 movzbl 4(%edi,%edx), %ecx
 cmpl %ebx, %eax
 movb %cl, 4(%esi,%edx)
 jne .L5

 But it can be easily simplified to something like this:

 .L5:
 addl $1, %eax
 movzbl (%esi,%eax,4), %edx
 cmpl %ecx, %eax
 movb %dl, (%ebx,%eax,4)
 jne .L5

 (i.e. left shift may be moved to address).

 First question to gcc-help maillist. May be there are some options,
 that I've missed, and there IS a way to explain gcc my intention to do
 this?

 And second question to gcc developers mail list. I am working on
 private backend and want to add this optimization to my backend. What
 do you advise me to do -- custom gimple pass, or rtl pass, or modify
 some existent pass, etc?
>>>
>>> This looks like a deficiency in induction variable optimization.  Note
>>> that i << 2 may overflow and this overflow does not invoke undefined
>>> behavior but is in the implementation defined behavior category.
>>>
>>> The issue in this case is likely that the SCEV infrastructure does not 
>>> handle
>>> left-shifts.
>>>
>>> Richard.
>>>
 ---
 With best regards, Konstantin


Oleg Endo appointed co-maintainer of SH port

2013-12-06 Thread David Edelsohn
I am pleased to announce that the GCC Steering Committee has
appointed Oleg Endo as co-maintainer of the SH port.

Please join me in congratulating Oleg on his new role.
Oleg, please update your listing in the MAINTAINERS file.

Happy hacking!
David



Re: x86 gcc lacks simple optimization

2013-12-06 Thread Richard Biener
On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov
 wrote:
> Hi,
>
> Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
> and now it yields code like (x86 again):
>
> .L5:
> movzbl 4(%esi,%eax,4), %edx
> movb %dl, 4(%ebx,%eax,4)
> addl $1, %eax
> cmpl %ecx, %eax
> jne .L5
>
> So, excessive lea is gone. It is great, thank you so much. But I
> wonder what else can I do to move add upper to simplify memory
> accesses (I am guessing, this is some arithmetical re-associations,
> still not sure where to look). For architecture, I am working on, it
> is important. What would you advise?

You need to look at IVOPTs and how it arrives at the choice of
induction variables.

Richard.

> ---
> With best regards, Konstantin
>
> On Fri, Dec 6, 2013 at 2:25 PM, Richard Biener
>  wrote:
>> On Fri, Dec 6, 2013 at 11:19 AM, Konstantin Vladimirov
>>  wrote:
>>> Hi,
>>>
>>> nothing changes if everything is unsigned and we are guaranteed to not
>>> raise UB on overflow:
>>>
>>> unsigned foo(unsigned char *t, unsigned char *v, unsigned w)
>>> {
>>> unsigned i;
>>>
>>> for (i = 1; i != w; ++i)
>>> {
>>> unsigned x = i << 2;
>>> v[x + 4] = t[x + 4];
>>> }
>>>
>>> return 0;
>>> }
>>>
>>> yields:
>>>
>>> .L5:
>>> leal 0(,%eax,4), %edx
>>> addl $1, %eax
>>> movzbl 4(%edi,%edx), %ecx
>>> cmpl %ebx, %eax
>>> movb %cl, 4(%esi,%edx)
>>> jne .L5
>>>
>>> What is SCEV infrastructure (guessing scalar evolutions?) and what
>>> files/passes to look in?
>>
>> tree-scalar-evolution.c, look at where it handles MULT_EXPR but
>> lacks LSHIFT_EXPR support.
>>
>> Richard.
>>
>>> ---
>>> With best regards, Konstantin
>>>
>>> On Fri, Dec 6, 2013 at 2:10 PM, Richard Biener
>>>  wrote:
 On Fri, Dec 6, 2013 at 9:30 AM, Konstantin Vladimirov
  wrote:
> Hi,
>
> Consider code:
>
> int foo(char *t, char *v, int w)
> {
> int i;
>
> for (i = 1; i != w; ++i)
> {
> int x = i << 2;
> v[x + 4] = t[x + 4];
> }
>
> return 0;
> }
>
> Compile it to x86 (I used both gcc 4.7.2 and gcc 4.8.1) with options:
>
> gcc -O2 -m32 -S test.c
>
> You will see loop, formed like:
>
> .L5:
> leal 0(,%eax,4), %edx
> addl $1, %eax
> movzbl 4(%edi,%edx), %ecx
> cmpl %ebx, %eax
> movb %cl, 4(%esi,%edx)
> jne .L5
>
> But it can be easily simplified to something like this:
>
> .L5:
> addl $1, %eax
> movzbl (%esi,%eax,4), %edx
> cmpl %ecx, %eax
> movb %dl, (%ebx,%eax,4)
> jne .L5
>
> (i.e. left shift may be moved to address).
>
> First question to gcc-help maillist. May be there are some options,
> that I've missed, and there IS a way to explain gcc my intention to do
> this?
>
> And second question to gcc developers mail list. I am working on
> private backend and want to add this optimization to my backend. What
> do you advise me to do -- custom gimple pass, or rtl pass, or modify
> some existent pass, etc?

 This looks like a deficiency in induction variable optimization.  Note
 that i << 2 may overflow and this overflow does not invoke undefined
 behavior but is in the implementation defined behavior category.

 The issue in this case is likely that the SCEV infrastructure does not 
 handle
 left-shifts.

 Richard.

> ---
> With best regards, Konstantin


Re: [Warning] Signed mistach for basic datatype.

2013-12-06 Thread Andrew Haley
On 12/06/2013 10:41 AM, Umesh Kalappa wrote:
> I’m bit confused or i'm missing something here .

The first of these is implementation-defined behaviour, and the second
is (potentially) undefined behaviour.

This is more of a generic C question than a GCC question.

Andrew.



Re: Make SImode as default mode for INT type.

2013-12-06 Thread Paul_Koning

On Dec 6, 2013, at 5:40 AM, Umesh Kalappa  wrote:

> Hi all,
> 
> We are re-targeting the gcc 4.8.1 to the 16 bit core ,where word =int
> = short = pointer= 16 , char = 8 bit  and long  =32 bit.
> 
> We model the above requirement as
> 
> #define BITS_PER_UNIT   8
> 
> #define BITS_PER_WORD   16
> 
> #define UNITS_PER_WORD  2
> 
> #define POINTER_SIZE16
> 
> #define SHORT_TYPE_SIZE 16
> 
> #define INT_TYPE_SIZE   16
> 
> #define LONG_TYPE_SIZE  32
> 
> #define FLOAT_TYPE_SIZE 16
> 
> #define DOUBLE_TYPE_SIZE32
> 
> Tried to compile the below sample by retargeted compiler
> 
> int a =10;
> 
> int b =10;
> 
> 
> int func()
> 
> {
> 
> return a+ b;
> 
> }
> 
> the compiler is stating that the a and b are global with short type(HI
> mode) of size 2 bytes.
> 
> where as we  need the word mode as SI not HI ,I do understand that the
> SI and HI modes are of  same size but till I insist  better to have SI
> mode.
> 
> Please somebody or expert in the  group  share their thought on the
> same  like how do we can achieve this ?
> 
> Thanks
> ~Umesh

As Richard mentioned, SImode is not "the mode for int" but rather "the mode for 
the type that's 4x the size of QImode".  So in your case, that would be the 
mode for "long" and HImode is the mode for "int".

Apart from the float and double sizes, what you describe is just like the pdp11 
target.  And indeed that target has int == HImode as expected.

paul


Re: Hmmm, I think we've seen this problem before (lto build):

2013-12-06 Thread Trevor Saunders
On Fri, Dec 06, 2013 at 10:47:00AM +0100, Richard Biener wrote:
> On Fri, Dec 6, 2013 at 5:47 AM, Trevor Saunders  wrote:
> > On Mon, Dec 02, 2013 at 12:16:18PM +0100, Richard Biener wrote:
> >> On Sun, Dec 1, 2013 at 12:30 PM, Toon Moene  wrote:
> >> > http://gcc.gnu.org/ml/gcc-testresults/2013-12/msg1.html
> >> >
> >> > FAILED: Bootstrap (build config: lto; languages: fortran; trunk revision
> >> > 205557) on x86_64-unknown-linux-gnu
> >> >
> >> > In function 'release',
> >> > inlined from 'release' at /home/toon/compilers/gcc/gcc/vec.h:1428:3,
> >> > inlined from '__base_dtor ' at
> >> > /home/toon/compilers/gcc/gcc/vec.h:1195:0,
> >> > inlined from 'compute_antic_aux' at
> >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2212:0,
> >> > inlined from 'compute_antic' at
> >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:2493:0,
> >> > inlined from 'do_pre' at
> >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4738:23,
> >> > inlined from 'execute' at
> >> > /home/toon/compilers/gcc/gcc/tree-ssa-pre.c:4818:0:
> >> > /home/toon/compilers/gcc/gcc/vec.h:312:3: error: attempt to free a 
> >> > non-heap
> >> > object 'worklist' [-Werror=free-nonheap-object]
> >> >::free (v);
> >> >^
> >> > lto1: all warnings being treated as errors
> >> > make[4]: *** [/dev/shm/wd26755/cczzGuTZ.ltrans13.ltrans.o] Error 1
> >> > make[4]: *** Waiting for unfinished jobs
> >> > lto-wrapper: make returned 2 exit status
> >> > /usr/bin/ld: lto-wrapper failed
> >> > collect2: error: ld returned 1 exit status
> >>
> >> Yes, I still see this - likely caused by IPA-CP / partial inlining and a 
> >> "bogus"
> >> warning for unreachable code.
> >
> > I'm really sorry about long delay here, I took a week off for
> > thanksgiving then was pretty busy with other stuff :/
> >
> > If I remove the now useless  worklist.release (); on line 2211 of
> > tree-ssa-pre.c lto bootstrap gets passed this issue to a stage 2 / 3
> > comparison failure.  However doing that also causes these two test
> > failures in a normal bootstrap / regression test cycle
> >
> > Tests that now fail, but worked before:
> >
> > unix/-m32: 17_intro/headers/c++200x/stdc++.cc (test for excess errors)
> > unix/-m32: 17_intro/headers/c++200x/stdc++_multiple_inclusion.cc (test
> > for excess errors)
> >
> > both of these failures are because of this ICE
> 
> This must be unrelated - please go ahead and install the patch removing
> the useless worklist.release () from tree-ssa-pre.c

done r205750 sorry about the breakage.

Trev

> 
> Thanks,
> Richard.
> 
> > Executing on host: /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++
> > -shared-libgcc -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs
> > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/
> > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/
> > -isystem
> > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include
> > -isystem
> > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include
> > -m32
> > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0
> > -ffunction-sections -fdata-sections -g -O2 -D_GNU_SOURCE -g -O2
> > -D_GNU_SOURCE -DLOCALEDIR="." -nostdinc++
> > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include/x86_64-unknown-linux-gnu
> > -I/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/include
> > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/libsupc++
> > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/include/backward
> > -I/tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/util
> > /tmp/tmp.rsz07gSDni/libstdc++-v3/testsuite/17_intro/headers/c++200x/stdc++_multiple_inclusion.cc
> > -std=gnu++0x -S  -m32 -o stdc++_multiple_inclusion.s(timeout = 600)
> > spawn /tmp/tmp.rsz07gSDni/test-objdir/./gcc/xg++ -shared-libgcc
> > -B/tmp/tmp.rsz07gSDni/test-objdir/./gcc -nostdinc++
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> > -L/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/libsupc++/.libs
> > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/bin/
> > -B/tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/lib/
> > -isystem
> > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/include
> > -isystem
> > /tmp/tmp.rsz07gSDni/test-install/x86_64-unknown-linux-gnu/sys-include
> > -m32
> > -B/tmp/tmp.rsz07gSDni/test-objdir/x86_64-unknown-linux-gnu/32/libstdc++-v3/src/.libs
> > -fdiagnostics-color=never -D_GLIBCXX_ASSERT -fmessage-length=0
> > -ffunction-sections -fdata-sections -

Re: Oleg Endo appointed co-maintainer of SH port

2013-12-06 Thread Oleg Endo
On Fri, 2013-12-06 at 09:05 -0500, David Edelsohn wrote:
>   I am pleased to announce that the GCC Steering Committee has
> appointed Oleg Endo as co-maintainer of the SH port.
> 
>   Please join me in congratulating Oleg on his new role.
> Oleg, please update your listing in the MAINTAINERS file.

Thank you.

I've just committed the following.

Index: MAINTAINERS
===
--- MAINTAINERS (revision 205756)
+++ MAINTAINERS (working copy)
@@ -102,6 +102,7 @@
 score port Chen Liqin  liqin@gmail.com
 sh portAlexandre Oliva aol...@redhat.com
 sh portKaz Kojima  kkoj...@gcc.gnu.org
+sh portOleg Endo   olege...@gcc.gnu.org
 sparc port Richard Henderson   r...@redhat.com
 sparc port David S. Miller da...@redhat.com
 sparc port Eric Botcazou   ebotca...@libertysurf.fr
@@ -364,7 +365,6 @@
 Bernd Edlinger bernd.edlin...@hotmail.de
 Phil Edwards   p...@gcc.gnu.org
 Mohan Embargnust...@thisiscool.com
-Oleg Endo  olege...@gcc.gnu.org
 Revital Eres   e...@il.ibm.com
 Marc Espie es...@cvs.openbsd.org
 Rafael �vila de Esp�ndola  espind...@google.com



Re: x86 gcc lacks simple optimization

2013-12-06 Thread Jeff Law

On 12/06/13 07:17, Richard Biener wrote:

On Fri, Dec 6, 2013 at 2:52 PM, Konstantin Vladimirov
 wrote:

Hi,

Richard, I tried to add LSHIFT_EXPR case to tree-scalar-evolution.c
and now it yields code like (x86 again):

.L5:
movzbl 4(%esi,%eax,4), %edx
movb %dl, 4(%ebx,%eax,4)
addl $1, %eax
cmpl %ecx, %eax
jne .L5

So, excessive lea is gone. It is great, thank you so much. But I
wonder what else can I do to move add upper to simplify memory
accesses (I am guessing, this is some arithmetical re-associations,
still not sure where to look). For architecture, I am working on, it
is important. What would you advise?


You need to look at IVOPTs and how it arrives at the choice of
induction variables.

Konstantin,

You might want to work will Ben Cheng from ARM; he's already poking in 
this code and may have some ideas.


jeff



Re: [RFC, LRA] Repeated looping over subreg reloads.

2013-12-06 Thread Vladimir Makarov

On 12/5/2013, 9:35 AM, Tejas Belagod wrote:

Vladimir Makarov wrote:

On 12/4/2013, 6:15 AM, Tejas Belagod wrote:

Hi,

I'm trying to relax CANNOT_CHANGE_MODE_CLASS for aarch64 to allow all
mode changes on FP_REGS as aarch64 does not have register-packing, but
I'm running into an LRA ICE. A test case generates an RTL subreg of the
following form

(set (reg:DF 97) (subreg:DF (reg:V2DF 95) 8))

LRA has to reload the subreg because the subreg is not representable as
a full register. When LRA reloads this in
lra-constraints.c:simplyfy_operand_subreg (), it seems to reload
SUBREG_REG() and leave the byte offset alone.

i.e.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

The code in lra-constraints.c is this conditional:

   /* Force a reload of the SUBREG_REG if this is a constant or PLUS or
  if there may be a problem accessing OPERAND in the outer
  mode.  */
   if ((REG_P (reg)
   
   insert_move_for_subreg (insert_before ? &before : NULL,
   insert_after ? &after : NULL,
   reg, new_reg);
 }
   

What happens subsequently is that LRA keeps looping over this RTL and
keeps reloading the SUBREG_REG() till the limit of constraint passes is
reached.

  (set (reg:V2DF 100) (reg:V2DF 95))
  (set (reg:DF 97) (subreg:DF (reg:V2DF 100) 8))

I can't see any place where this subreg is resolved (eg. into equiv
memref) before the next iteration comes around for reloading the inputs
and outputs of curr_insn. Or am I missing something some part of code
that tries reloading the subreg with different alternatives or reg
classes?



I guess this behaviour is wrong.  We could spill the V2DF pseudo or
put it into another class reg. But it is not implemented.  This code
is actually a modified version of reload pass one.  We could implement
alternative strategies and a check for potential loop (such code
exists in process_alt_operands).

Could you send me the macro change and the test.  I'll look at it and
figure out what can we do.


Hi,

Thanks for looking at this.

The macro change is in this patch
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg03638.html. The test is
gcc.c-torture/compile/simd-3.c and when compiled with -O1 for aarch64,
ICEs:

gcc/testsuite/gcc.c-torture/compile/simd-3.c:22:1: internal compiler
error: Maximum number of LRA constraint passes is achieved (30)

Also, I'm curious to know - is it possible to vec_extract for vector
mode subregs and zero/sign extract for scalars and spilling be the last
resort if either of these are not possible? As you say, non-zero
SUBREG_BYTE offset could also be resolved using a different regclass
where the sub-mode could just be a full-register.



Here is the patch which solves the problem.  Right now it is only 
spilling but it is the best what can be done for this case.  I'll submit 
the patch on the next week after better testing on different platforms.


Vec_extract is interesting but it is a rare case which needs a lot of 
code to implement this.  I think we need more general approach called 
bitwidth-aware RA (putting several pseudo values into regs, e.g vec 
regs).  Although I don't know will it help for arm64 cpus.  Last time i 
checked manually bitwidth-aware RA for intel cpus, it makes code bigger 
and slower.


If there is a mainstream processor for which it can improve performance, 
i'd put it in my higher priority list to do.





Index: lra-constraints.c
===
--- lra-constraints.c   (revision 205449)
+++ lra-constraints.c   (working copy)
@@ -1237,9 +1237,20 @@ simplify_operand_subreg (int nop, enum m
&& ! LRA_SUBREG_P (operand))
   || CONSTANT_P (reg) || GET_CODE (reg) == PLUS || MEM_P (reg))
 {
-  /* The class will be defined later in curr_insn_transform.  */
-  enum reg_class rclass
-   = (enum reg_class) targetm.preferred_reload_class (reg, ALL_REGS);
+  enum reg_class rclass;
+
+  if (REG_P (reg)
+ && curr_insn_set != NULL_RTX
+ && (REG_P (SET_SRC (curr_insn_set))
+ || GET_CODE (SET_SRC (curr_insn_set)) == SUBREG))
+   /* There is big probability that we will get the same class
+  for the new pseudo and we will get the same insn which
+  means infinite looping.  So spill the new pseudo.  */
+   rclass = NO_REGS;
+  else
+   /* The class will be defined later in curr_insn_transform.  */
+   rclass
+ = (enum reg_class) targetm.preferred_reload_class (reg, ALL_REGS);
 
   if (get_reload_reg (curr_static_id->operand[nop].type, reg_mode, reg,
  rclass, "subreg reg", &new_reg))