[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog

2009-05-04 Thread carrot at google dot com


--- Comment #8 from carrot at google dot com  2009-05-04 10:08 ---
Sorry for my ignorance to gcc. What types of instructions reload will add?
Spilling and loading registers? and more?

By reading the the implementation of thumb_far_jump_used_p() I can get the
conclusion that if reload thinks there is a far jump, later pass won't change
this decision. But if reload thinks there is no far jump, later pass still need
to re-check the far jump existence and may change this decision. So if reload
occasionally makes a wrong decision later pass should correct it, is it right?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570



[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog

2009-05-05 Thread carrot at google dot com


--- Comment #10 from carrot at google dot com  2009-05-05 15:32 ---
(In reply to comment #9)
> (In reply to comment #8)
> > Sorry for my ignorance to gcc. What types of instructions reload will add?
> > Spilling and loading registers? and more?
> > 
> That's pretty much it, but...
Before register spilling, it must have used up all physical registers,
including callee saved registers. Any saving of callee saved register should
already have disabled this optimization.

> 
> > By reading the the implementation of thumb_far_jump_used_p() I can get the
> > conclusion that if reload thinks there is a far jump, later pass won't 
> > change
> > this decision. But if reload thinks there is no far jump, later pass still 
> > need
> > to re-check the far jump existence and may change this decision. So if 
> > reload
> > occasionally makes a wrong decision later pass should correct it, is it 
> > right?
> > 
> 
> 
> Once reload has completed we can't change the decision as to whether or not LR
> gets saved onto the stack or not.  Unfortunately, that doesn't play well with
> constant pools.  We sometimes need to inline these, and that might cause
> branches to be pushed out of range.  Since we don't inline the pools until
> after reload has completed, that's a major stumbling block.  The current code
> just isn't aware of these issues.
> 

It looks like a bug in current code and my patch tries to exploit it. We should
fix it by checking far jump (or thumb_force_lr_save) in reload pass only and
simply get this computed value in later pass. 

It looks computing the exact limit is very difficult if not impossible. Could
we simply use a predefined constant which is much much smaller than the far
jump threshold as the limit? For example, use the constant 256 which is only
1/8 of the far jump threshold. I don't expect a larger function can have any
chance to satisfy other conditions: leaf function and doesn't use any callee
saved registers.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570



[Bug rtl-optimization/40314] New: inefficient address calculation of fields of large struct

2009-05-30 Thread carrot at google dot com
Given a structure and 3 field access

typedef struct network
{
   char inputfile[400];
   int* nodes;
   int* arcs;
   int* stop_arcs;
} network_t;

   int *arc;
   int *node = net->nodes;   <--- A
   void *stop = (void *)net->stop_arcs;  <--- B
   for( arc = net->arcs; arc != (int *)stop; arc++ ) <--- C

GCC generates following instruction sequence in thumb mode with options -O2
-Os, it needs 9 insts to load 3 fields 

   mov r2, #200 <---  A1
   lsl r1, r2, #1   <---  A2
   .loc 1 14 0
   mov r4, #204 <  B1
   lsl r3, r4, #1   <---   B2
   .loc 1 13 0
   ldr r2, [r0, r1]  <  A3
   .loc 1 15 0
   mov r1, #202  <---  C1
   .loc 1 14 0
   ldr r4, [r0, r3]  <--- B3
   .loc 1 15 0
   lsl r3, r1, #1<---  C2
   ldr r3, [r0, r3]   <---  C3

A better method is adjusting the base address first, which is nearer to all 3
fields we will access. Then we can use ldr dest, [base, offset] to load each
fields with only 1 instruction.

Although this opportunity is found in target ARM, it should also be applicable
to other architectures with addressing mode of (base + offset) and offset has a
limited value range.


-- 
   Summary: inefficient address calculation of fields of large
struct
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314



[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct

2009-05-30 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-05-31 02:42 ---
Created an attachment (id=17940)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17940&action=view)
test case to show the opportunity


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314



[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct

2009-05-30 Thread carrot at google dot com


--- Comment #2 from carrot at google dot com  2009-05-31 02:51 ---
There are a lot of such opportunities in mcf from SPEC CPU 2006.

One possible implementation is to add a pass before cse. In the new pass it
should detect insn patterns like:

(set r200 400) # 400 is offset of field1
(set r201 (mem (plus r100 r200)))  # r100 contains struct base
...
(set r300 404) # 404 is offset of field2
(set r301 (mem (plus r100 r300)))  # r100 contains struct base

And rewrite them as:

(set r200 400) # keep the original insn
(set r250 (plus r100 400)) # r250 is new base
(set r201 (mem (plus r250 0)))
...
(set r300 404)
(set r251 (plus r100 400)) # r251 contains same value as r250
(set r301 (mem (plus r251 4)))

We can let dce and cse remove the redundant code, the final result should look
like:

(set r101 (plus r100 400))
(set r201 (mem (plus r101 0)))
...
(set r301 (mem (plus r101 4)))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314



[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct

2009-05-31 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-05-31 08:05 ---
(In reply to comment #3)
> I think we have enough passes already and should try to stuff this in cse.c 
> and
> fwprop.c.  See PR middle-end/33699 for related issues.
> 

It looks that patch solved some similar issues. But there are still several
differences:

1. PR/33699 can only handle constant addresses, while in my case the addresses
are not constants. And I believe non-constant cases (memory accesses through
pointer) occurs more frequently than constant addresses(embedded system only?).

2. That patch can only be applicable to known base address. While in my case,
the known base address of memory accesses are the pointer to struct, there is
no known nearby base address, so we need to create a new nearby base address.

3. That patch works on superblock, but it looks better to optimize the memory
accesses on the whole function body, it is quite common to access memory
through same pointer in different basic blocks, as shown in mcf.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314



[Bug rtl-optimization/40327] New: Use less instructions to add some constants to register

2009-06-03 Thread carrot at google dot com
Compiling this simple function in thumb mode:

int add_const(int x)
{
 return x+400;
}

I got:

   mov r1, #200
   lsl r3, r1, #1
   add r0, r0, r3

A better code sequence should be:

add r0, r0, 200
add r0, r0, 200

In order to apply this optimization, the constant should be less than 2 times
of the largest immediate value in the target ISA. So this optimization should
also useful to other architecture with limited immediate operand range.

It can also be applied to sub instruction.


-- 
   Summary: Use less instructions to add some constants to register
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40327



[Bug target/40375] New: redundant register move with -mthumb

2009-06-07 Thread carrot at google dot com
Compile the following code with -mthumb -O2 -Os,

extern void foo(int*, const char*, int);
void test(const char name[], int style)
{
   foo(0, name, style);
}

I got:

push{r4, lr}
mov r3, r0  //  A
mov r2, r1  //  B
mov r0, #0  //  C
mov r1, r3  //  D
bl  foo
pop {r4, pc}

Instructions A and D move register r0 to r1, actually it can be replaced with 1
instruction
mov  r1, r0
and place it between B and C.


-- 
   Summary: redundant register move with -mthumb
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375



[Bug target/40375] redundant register move with -mthumb

2009-06-07 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-08 03:23 ---
Created an attachment (id=17962)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17962&action=view)
test case shows the redundant register move

This problem occurs quite frequently if both caller and callee have multiple
parameters.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375



[Bug target/40375] redundant register move with -mthumb

2009-06-08 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-06-09 03:46 ---
Thank you, Steven.

(In reply to comment #3)
> "might be" is such a useless statement.
> 
> Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume?  Then
> you should have no trouble finding out:
> 1) Where the move comes from

This is rtl dump before RA, everything is in normal state:

cat obj/reg.c.173r.asmcons
;; Function test (test)

(note 1 0 5 NOTE_INSN_DELETED)

(note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 133 [ name ])
(reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD
(reg:SI 0 r0 [ name ])
(nil)))

(insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 134 [ style ])
(reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD
(reg:SI 1 r1 [ style ])
(nil)))

(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)

(insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0)
(const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil))

(insn 8 7 9 2 src/./reg.c:4 (set (reg:SI 1 r1)
(reg/v/f:SI 133 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD
(reg/v/f:SI 133 [ name ])
(nil)))

(insn 9 8 10 2 src/./reg.c:4 (set (reg:SI 2 r2)
(reg/v:SI 134 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD
(reg/v:SI 134 [ style ])
(nil)))

(call_insn 10 9 0 2 src/./reg.c:4 (parallel [
(call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] ) [0 S4 A32])
(const_int 0 [0x0]))
(use (const_int 0 [0x0]))
(clobber (reg:SI 14 lr))
]) 256 {*call_insn} (expr_list:REG_DEAD (reg:SI 2 r2)
(expr_list:REG_DEAD (reg:SI 1 r1)
(expr_list:REG_DEAD (reg:SI 0 r0)
(nil
(expr_list:REG_DEP_TRUE (use (reg:SI 2 r2))
(expr_list:REG_DEP_TRUE (use (reg:SI 1 r1))
(expr_list:REG_DEP_TRUE (use (reg:SI 0 r0))
(nil)

Here is rtl dump after RA, quite straightforward but inefficient:

(note 1 0 5 NOTE_INSN_DELETED)

(note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 3 r3 [orig:133 name ] [133])
(reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (nil))

(insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 2 r2 [orig:134 style ] [134])
(reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (nil))

(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)

(insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0)
(const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil))

(insn 8 7 10 2 src/./reg.c:4 (set (reg:SI 1 r1)
(reg/v/f:SI 3 r3 [orig:133 name ] [133])) 168 {*thumb1_movsi_insn}
(nil))

(call_insn 10 8 18 2 src/./reg.c:4 (parallel [
(call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] ) [0 S4 A32])
(const_int 0 [0x0]))
(use (const_int 0 [0x0]))
(clobber (reg:SI 14 lr))
]) 256 {*call_insn} (nil)
(expr_list:REG_DEP_TRUE (use (reg:SI 2 r2))
(expr_list:REG_DEP_TRUE (use (reg:SI 1 r1))
(expr_list:REG_DEP_TRUE (use (reg:SI 0 r0))
(nil)

(note 18 10 0 NOTE_INSN_DELETED)

> 2) Why postreload (the post-reload CSE pass) does not eliminate the redundant
> move
> 
It seems the post-reload CSE pass can't handle this case. Because at
instruction C r0 is killed so instruction can't use r0. In order to make it
work for this case we must move instruction D before C first.

As Andrew said we need to improve scheduling before RA to handle this.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375



[Bug c++/40382] New: Useless instructions in destructor

2009-06-09 Thread carrot at google dot com
Compile following simple class with -O2 -Os -mthumb -fpic

class base {
  virtual ~base();
};

base::~base()
{
}

The destructor of this class should do nothing, just return is enough. But gcc
generats following codes for D1 version destructor:

ldr r3, .L3
ldr r1, .L3+4
add r3, pc
ldr r2, [r3, r1]
add r2, r2, #8
str r2, [r0]
bx  lr
.L3:
.word   _GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
.word   _ZTV4base(GOT)


-- 
   Summary: Useless instructions in destructor
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40382



[Bug c++/40382] Useless instructions in destructor

2009-06-09 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-09 07:35 ---
Created an attachment (id=17969)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17969&action=view)
simple class with empty virtual destructor

Some tree dump result

1. The tree dump of early stage:
cat test_class.cpp.003t.original

;; Function virtual base::~base() (null)
;; enabled by -tree-original

{
 <_vptr.base = &_ZTV4base + 8) >>>
>>;
}
:;
if ((bool) (__in_chrg & 1))
 {
   <>>
>>;
 }
return this;

2. The tree dump of late stage, the reset of vptr is redundant.

cat test_class.cpp.130t.final_cleanup

;; Function base::~base() (_ZN4baseD2Ev)

base::~base() (struct base * const this)
{
:
 this->_vptr.base = &_ZTV4base[2];
 return this;

}

;; Function virtual base::~base() (_ZN4baseD1Ev)

virtual base::~base() (struct base * const this)
{
:
 this->_vptr.base = &_ZTV4base[2];
 return this;

}

;; Function virtual base::~base() (_ZN4baseD0Ev)

virtual base::~base() (struct base * const this)
{
:
 this->_vptr.base = &_ZTV4base[2];
 operator delete (this);
 return this;

}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40382



[Bug target/40375] redundant register move with -mthumb

2009-06-09 Thread carrot at google dot com


--- Comment #6 from carrot at google dot com  2009-06-09 13:52 ---
(In reply to comment #5)
> Hmm, I was under the impression that postreload-cse could move instructions
> too, but that was just wishful thinking.
> 
I will look into postreload-cse.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375



[Bug target/40416] New: unnecessary register spill

2009-06-11 Thread carrot at google dot com
Compile the attached source code with options -O2 -Os -mthumb -fpic, we can get
a unnecessary register spill.


-- 
   Summary: unnecessary register spill
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416



[Bug target/40416] unnecessary register spill

2009-06-11 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-11 14:34 ---
Created an attachment (id=17983)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17983&action=view)
test case

The spilling is occurred around the first loop:

push{r4, r5, r6, r7, lr}
sub sp, sp, #12
.loc 1 5 0
str r2, [sp, #4]  // A
.loc 1 6 0
add r6, r1, r2
mov r4, r0
.loc 1 8 0
b   .L2
.L5:
.loc 1 10 0
mov r7, #0
ldrsh   r5, [r4, r7]
.loc 1 12 0
cmp r2, r5
bge .L3
.loc 1 14 0
ldrbr7, [r1]
strbr7, [r1, r2]
.loc 1 15 0
strhr2, [r4]
.loc 1 16 0
lsl r1, r2, #1
sub r2, r5, r2
strhr2, [r1, r4]
.L6:
.loc 1 5 0
ldr r5, [sp, #4] //   B
lsl r4, r5, #1
add r0, r0, r4
b   .L4
.L3:
.loc 1 19 0
lsl r7, r5, #1
mov ip, r7
add r4, r4, ip
.loc 1 20 0
add r1, r1, r5
.loc 1 21 0
sub r2, r2, r5
.L2:
.loc 1 8 0
cmp r2, #0
bgt .L5
b   .L6
.L4:
.loc 1 30 0
mov r1, #0


The spilling is occurred at instruction A and reload at instruction B.

The spilled value is x. The source code computes next_runs and next_alpha
before while loop and preserve them through the loop body. But the generated
code preserve next_alpha, original runs and original x through the loop body
and compute next_runs after the loop. This caused an extra usage of register
and results in a register spilling.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416



[Bug target/40416] unnecessary register spill

2009-06-14 Thread carrot at google dot com


--- Comment #3 from carrot at google dot com  2009-06-15 02:26 ---
Created an attachment (id=17998)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17998&action=view)
preprocessed test case

A possible code sequence without spilling is:

push{r4, r5, r6, r7, lr}
add r6, r1, r2
mov r4, r0
lsl r7, r2, 1 // New
add r0, r0, r7// New
.loc 1 8 0
b   .L2
.L5:
.loc 1 10 0
mov r7, #0
ldrsh   r5, [r4, r7]
.loc 1 12 0
cmp r2, r5
bge .L3
.loc 1 14 0
ldrbr7, [r1]
strbr7, [r1, r2]
.loc 1 15 0
strhr2, [r4]
.loc 1 16 0
lsl r1, r2, #1
sub r2, r5, r2
strhr2, [r1, r4]
.L6:
.loc 1 5 0
b   .L4
.L3:
.loc 1 19 0
lsl r7, r5, #1
mov ip, r7
add r4, r4, ip
.loc 1 20 0
add r1, r1, r5
.loc 1 21 0
sub r2, r2, r5
.L2:
.loc 1 8 0
cmp r2, #0
bgt .L5
b   .L6
.L4:
.loc 1 30 0
mov r1, #0


-- 

carrot at google dot com changed:

   What|Removed |Added

  Attachment #17983|0   |1
is obsolete||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416



[Bug target/40416] unnecessary register spill

2009-06-14 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-06-15 02:32 ---
In the source code, only two extra variables next_runs and next_alpha need to
be preserved through the while loop.

But in the gcc generated code, three variables are kept through the first loop.
They are next_alpha, original runs and original x. The expression (next_runs =
runs + x) is moved after the loop. This caused an extra var through the loop
and resulted in register spilling.

The expression move is occurred in tree-ssa-sink pass. Daniel Berlin has
confirmed it is a bug in this pass.

 From Daniel **
This looks like a bug, i think i know what causes it.
When I wrote this pass, i forgot to make this check:

 /* It doesn't make sense to move to a dominator that post-dominates
frombb, because it means we've just moved it into a path that always
executes if frombb executes, instead of reducing the number of
executions .  */

 if (dominated_by_p (CDI_POST_DOMINATORS, frombb, commondom))

happen regardless of whether it is a single use statement or not.
So it will sink single use statements even if it's just moving them to
places that aren't executed less frequently.

Add that check (changing commondom to sinkbb) and it should stop moving it.
*** End From Daniel 

I will send the patch later.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416



[Bug target/40457] New: use stm and ldm to access consecutive memory words

2009-06-16 Thread carrot at google dot com
Current gcc can't make use of stm and ldm to reduce code size.


-- 
   Summary: use stm and ldm to access consecutive memory words
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457



[Bug target/40457] use stm and ldm to access consecutive memory words

2009-06-16 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-16 09:11 ---
Created an attachment (id=18005)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18005&action=view)
test case

For this function

void foo(int* p)
{
  p[0] = 1;
  p[1] = 2;
}

gcc generates:

mov r1, #1
mov r3, #2
str r1, [r0]
str r3, [r0, #4]
bx  lr

We use one stm instruction to replace two str instructions.

For the second case:

int bar(int* p)
{
  int x = p[0] + p[1];
  return x;
}

gcc generates:

ldr r2, [r0, #4]
ldr r3, [r0]
add r0, r2, r3
bx  lr

In this case we can use on ldm to replace the two ldr instructions.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457



[Bug target/40457] use stm and ldm to access consecutive memory words

2009-06-17 Thread carrot at google dot com


--- Comment #7 from carrot at google dot com  2009-06-17 09:30 ---
My command line option is -O2 -Os -mthumb

The compiler didn't run into load_multiple_sequence and
store_multiple_sequence. The peephole rules specified it applies to TARGET_ARM
only. Is there any special reason we didn't enable it in thumb mode?

For the ascending register number, do we have any code to rename a set of
registers to make them ascending? In the generated code for the second
function, the register numbers have different order compared with memory
offsets.

ldr r2, [r0, #4]
ldr r3, [r0]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457



[Bug target/40482] New: shift a small constant to get larger one

2009-06-18 Thread carrot at google dot com
One example is 0xff00, we can get it by

  mov r1, 255
  lsl r1, r1, 24

Gcc generates following code:

  ldr r1, .L2
  ...
.L2
  .word   -16777216


-- 
   Summary: shift a small constant to get larger one
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40482



[Bug target/40482] shift a small constant to get larger one

2009-06-18 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-18 07:34 ---
Created an attachment (id=18018)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18018&action=view)
test case

command line option is -O2 -Os -mthumb


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40482



[Bug target/40499] New: [missed optimization] branch to return

2009-06-19 Thread carrot at google dot com
If the function epilogue has only one return instruction, then the branch to
return can be replaced by the return instruction directly.


-- 
   Summary: [missed optimization] branch to return
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499



[Bug target/40499] [missed optimization] branch to return

2009-06-19 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-06-20 03:56 ---
Created an attachment (id=18027)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18027&action=view)
test case

The command line options are: -march=armv5te -mthumb -Os

At the end of the function we can see

b   .L3  // This one can be replaced by pop {pc}
.L5:
mov r0, #1
.L3:
@ sp needed for prologue
pop {pc}

With option -O2 we can get similar result.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499



[Bug target/40499] [missed optimization] branch to return not threaded on thumb

2009-06-22 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-06-22 08:00 ---
Sorry I didn't make it clear. It is a performance bug, not a code size issue.
If the epilogue is a simple return instruction, the branch to return can be
replaced by the return instruction. So we can execute one less instruction at
run time without any code size penalty.

It looks the code at function.c:5078 can't be applied to thumb.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499



[Bug target/40525] New: missed optimization in conditional expression

2009-06-23 Thread carrot at google dot com
For simple conditional expression like (flag == 1 ? 2 : 0), gcc generates not
optimized code in terms of both code size and performance.


-- 
   Summary: missed optimization in conditional expression
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40525



[Bug target/40525] missed optimization in conditional expression

2009-06-23 Thread carrot at google dot com


--- Comment #2 from carrot at google dot com  2009-06-23 09:09 ---
Created an attachment (id=18053)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18053&action=view)
test case

Compile the attached code with options -mthumb -march=armv5te -Os, gcc
generates

push{lr}
cmp r1, #1
bne .L3
mov r3, #2
b   .L2
.L3:
mov r3, #0
.L2:
add r0, r3, r0
pop {pc}

A better code sequence can be:

push{lr}
mov r3, 0
cmp r1, #1
bne .L3
mov r3, #2
.L3:
add r0, r3, r0
pop {pc}

With this optimization, we can reduce 1 instruction. For both equal and not
equal case, the number of executed instructions is same as previous. But in
equal case one branch instruction is replaced by a move instruction. So it is
also win for performance.

Which pass should this optimization be done? Jump pass or bb reorder pass?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40525



[Bug target/40416] unnecessary register spill

2009-06-30 Thread carrot at google dot com


--- Comment #6 from carrot at google dot com  2009-06-30 07:42 ---
http://gcc.gnu.org/ml/gcc-cvs/2009-06/msg01067.html


-- 

carrot at google dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416



[Bug target/40603] New: unnecessary conversion from unsigned byte load to signed byte load

2009-06-30 Thread carrot at google dot com
Compile the following function with options -Os -mthumb -march=armv5te

int ldrb(unsigned char* p)
{
if (p[8] <= 0x7F)
  return 2;
  else
return 5;
}

Gcc generates following codes:

push{lr}
mov r3, #8
ldrsb   r3, [r0, r3]
mov r0, #2
cmp r3, #0
bge .L2
mov r0, #5
.L2:
@ sp needed for prologue
pop {pc}

The source codeif (p[8] <= 0x7F) is translated to:

mov r3, #8
ldrsb   r3, [r0, r3]
cmp r3, #0

A better code sequence should be:

ldrbr3, [r0, 8]
cmp r3, 0x7F

This can save one instruction.

The tree dump shows in a very early pass (ldrb.c.003t.original) the comparison
was transformed to
   if ((signed char) *(p + 8) >= 0)

I guess gcc thinks comparing with 0 is much cheaper than comparing with other
numbers. Am I right?

Unfortunately in thumb mode, loading a signed byte costs more than loading an
unsigned byte and comparing with 0 has same cost as comparing with 0x7F.


-- 
   Summary: unnecessary conversion from unsigned byte load to signed
byte load
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603



[Bug target/40603] unnecessary conversion from unsigned byte load to signed byte load

2009-06-30 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-01 06:56 ---
Created an attachment (id=18105)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18105&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603



[Bug target/40603] unnecessary conversion from unsigned byte load to signed byte load

2009-07-01 Thread carrot at google dot com


--- Comment #3 from carrot at google dot com  2009-07-01 10:24 ---
(In reply to comment #2)
> Subject: Re:   New: unnecessary conversion from unsigned
> byte load to signed byte load
> 
> 
> > Unfortunately in thumb mode, loading a signed byte costs more than loading 
> > an
> > unsigned byte and comparing with 0 has same cost as comparing with 0x7F.
> 
> I don't know of any core where loading a signed byte is more expensive
> than unsigned byte in thumb mode. What did you have in mind ?
> 
> I suspect what you mean is that the sign extension here is not required
> and we could get away with ldrb.
> 
In thumb1, instruction ldrb has an addressing mode of Rn + imm5, but ldrsb has
only addressing mode of Rn + Rm. So loading unsigned byte from p[8] needs only
one instruction
ldrb r3, [r0, 8]

But loading singed byte from p[8] needs two instructions:

mov   r3, 8
ldrsb r3, [r0, r3]

So in this case (base + constant offset), loading a signed byte is more
expensive than unsigned byte in thumb mode.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603



[Bug target/40615] New: unnecessary CSE

2009-07-02 Thread carrot at google dot com
Compile the attached source code with options -march=armv5te -mthumb -Os
-fno-exceptions, gcc generates:

push{r4, lr}
sub sp, sp, #8
add r4, sp, #4// redundant
mov r0, r4// add  r0, sp, 4
bl  _ZN1XC1Ev
mov r0, r4// add  r0, sp, 4
bl  _Z3barP1X
mov r0, r4// add  r0, sp, 4
bl  _ZN1XD1Ev
add sp, sp, #8
@ sp needed for prologue
pop {r4, pc}

As mentioned in the comments, the cse is redundant. We can recompute the value
of (sp + 4) each time we want it. With this method we can save one instruction.


-- 
   Summary: unnecessary CSE
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40615



[Bug target/40615] unnecessary CSE

2009-07-02 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-02 07:39 ---
Created an attachment (id=18120)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18120&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40615



[Bug target/40657] New: allocate local variables with fewer instructions

2009-07-06 Thread carrot at google dot com
Compile following code with options -Os -mthumb -march=armv5te

extern void bar(int*);
int foo()
{
  int x;
  bar(&x);
  return x;
}

Gcc generates:

push{lr}
sub sp, sp, #12
add r0, sp, #4
bl  bar
ldr r0, [sp, #4]
add sp, sp, #12
@ sp needed for prologue
pop {pc}

A better code sequence could be:

push{r1-r3,lr}
add r0, sp, #4
bl  bar
ldr r0, [sp, #4]
@ sp needed for prologue
pop {r1-r3, pc}

The local variable allocation and deallocation can be merged into the push/pop
instruction, so we can avoid the extra sub/add instructions and reduce two
instructions.


-- 
   Summary: allocate local variables with fewer instructions
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657



[Bug target/40657] allocate local variables with fewer instructions

2009-07-06 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-06 08:16 ---
Created an attachment (id=18140)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18140&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657



[Bug target/40657] allocate local variables with fewer instructions

2009-07-06 Thread carrot at google dot com


--- Comment #5 from carrot at google dot com  2009-07-07 06:44 ---
Could we do the optimization in function thumb1_expand_prologue? If we find
this opportunity in function thumb1_expand_prologue, we can remove the sp
manipulations from prologue and epilogue. We also should add extra registers to
push/pop operands.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657



[Bug target/40670] New: Load floating point constant 0 directly

2009-07-07 Thread carrot at google dot com
Compile following function with options -Os -mthumb -march=armv5te

float return_zero()
{
  return 0;
}

Gcc generates:

ldr r0, .L2
bx  lr
.L3:
.align  2
.L2:
.word   0

Floating point 0 is also integer 0. So the function body can be simplified as

   mov r0, 0
   bx  lr

Now we can remove the memory load and constant pool. The result code is smaller
and faster.

The memory load and constant pool is expanded in pass machine_reorg.


-- 
   Summary: Load floating point constant 0 directly
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40670



[Bug target/40670] Load floating point constant 0 directly

2009-07-07 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-07 09:38 ---
Created an attachment (id=18149)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18149&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40670



[Bug target/40680] New: extra register move

2009-07-08 Thread carrot at google dot com
Compile the attached source code with options -Os -mthumb -march=armv5te, gcc
generates:

push{r3, r4, r5, lr}
.LCFI0:
mov r4, r0
ldr r0, [r0]
bl  _Z3foof
ldr r1, [r4, #4]
@ sp needed for prologue
add r5, r0, #0
bl  _Z3barfi
mov r0, r5   // *
bl  _Z3fffi  // *
mov r4, r5   // *
mov r5, r0   // *
mov r0, r4   // *
bl  _Z3fffi  // *
mov r1, r0   // *
mov r0, r5   // *
bl  _Z3setii
pop {r3, r4, r5, pc}

There is an obvious extra register move (mov r4, r5) in the marked section, a
better code sequence of the marked section could be:

mov r0, r5
bl  _Z3fffi
mov r4, r0
mov r0, r5
bl  _Z3fffi
mov r1, r0
mov r0, r4

The marked code sequence before scheduler is:

mov r4, r5
mov r0, r5
bl  _Z3fffi
mov r5, r0
mov r0, r4
bl  _Z3fffi
mov r1, r0
mov r0, r5

The instruction (mov r4, r5 ) is generated by register allocator. I don't know
why RA generates this instruction.


-- 
   Summary: extra register move
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40680



[Bug target/40680] extra register move

2009-07-08 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-08 09:36 ---
Created an attachment (id=18155)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18155&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40680



[Bug target/40697] New: inefficient code to extract least bits from an integer value

2009-07-09 Thread carrot at google dot com
Compile following function with options -Os -mthumb -march=armv5te

unsigned get_least_bits(unsigned value)
{
  return value << 9 >> 9;
}

Gcc generates:

ldr r3, .L2
@ sp needed for prologue
and r0, r0, r3
bx  lr
.L3:
.align  2
.L2:
.word   8388607

A better code sequence should be:

   lsl   r0, 9
   lsr   r0, 9
   bxlr

It is smaller (without constant pool) and faster.

This transformation was done very early and we can see it in the first tree
dump shift.c.003t.original. Gcc thinks and with a constant is cheaper than two
shifts. It is not true for this case in thumb ISA. On the other hand if the
constant used to and is small, such as 7, it is definitely cheaper than two
shifts. So which method is better is highly depend on both the constant and the
target ISA. It is difficult to make a correct decision in the TREE level. Maybe
we should define a peephole rule to do it.


-- 
   Summary: inefficient code to extract least bits from an integer
value
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40697



[Bug target/40697] inefficient code to extract least bits from an integer value

2009-07-09 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-09 09:24 ---
Created an attachment (id=18166)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18166&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40697



[Bug target/40730] New: redundant memory load

2009-07-13 Thread carrot at google dot com
Compile the attached source code with options -Os -mthumb -march=armv5te
-fno-strict-aliasing, Gcc generates:

iterate:
push{lr}
ldr r3, [r1]// C
b   .L5
.L4:
ldr r3, [r3, #8]// D
.L5:
str r3, [r0]//  A
ldr r3, [r0]//  B
cmp r3, #0
beq .L3
ldr r2, [r3, #4]
cmp r2, #0
beq .L4
.L3:
str r3, [r0, #12]
@ sp needed for prologue
pop {pc}

Pay attention to instructions marked as A and B. Instruction A store r3 to [r0]
but insn B load it back to r3.

The instruction A was originally put after instruction C and D. After register
allocation, they were allocated to the same registers and looks exactly same.
In pass csa, cleanup_cfg was called and it found the same instructions and
moved them before instruction B. Now instruction B is obviously redundant.

Is it OK to remove this kind of redundant code in pass dce?


-- 
   Summary: redundant memory load
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730



[Bug target/40730] redundant memory load

2009-07-13 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-13 08:58 ---
Created an attachment (id=18183)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18183&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730



[Bug target/40741] New: code size explosion for integer comparison

2009-07-14 Thread carrot at google dot com
Compile following function with options -Os -mthumb -march=armv5te:

int returnbool(int a, int b)
{
if (a < b)
return 1;
return 0;
}

Gcc 4.5 generates:

lsr r3, r1, #31
asr r2, r0, #31
cmp r0, r1
adc r2, r2, r3
mov r0, r2
mov r3, #1
eor r0, r0, r3
@ sp needed for prologue
bx  lr

while gcc 4.3.1 generates:

push{lr}
mov r3, #1
cmp r0, r1
blt .L2
mov r3, #0
.L2:
mov r0, r3
@ sp needed for prologue
pop {pc}

If we count instructions to do comparison only, they are 7 vs 4. I don't know
if it is faster to replace one branch instruction with 4 alu instructions. It
is definitely a regression for code size.

The long code sequence is generated by expand pass.


-- 
   Summary: code size explosion for integer comparison
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40741



[Bug target/40741] code size explosion for integer comparison

2009-07-14 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-14 08:41 ---
Created an attachment (id=18191)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18191&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40741



[Bug target/40730] redundant memory load

2009-07-14 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-07-14 09:14 ---
In TREE level, the two stores are different statements. Only after register
allocation, the two stores get same register and make the load redundant.

try_crossjump_bb tries to find same instruction sequence in all predecessors of
a basic block bb, and move that code sequence to head of bb. It is triggered by
this function, and the store is moved just before the load.

I tried -fgcse-las but it couldn't do the work.

(In reply to comment #2)
> -fgcse-las should do the trick.  Note that PRE would do this kind of
> optimization on the tree-level, but it is disabled with -Os (so is gcse).
> 
> :
>   D.1614_2 = p2_1(D)->front;
>   p1_3(D)->head = D.1614_2;
>   goto ;
> 
> :
>   D.1616_8 = D.1615_4->next;
>   p1_3(D)->head = D.1616_8;
> 
> :
>   D.1615_4 = p1_3(D)->head;
> 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730



[Bug target/40730] redundant memory load

2009-07-15 Thread carrot at google dot com


--- Comment #7 from carrot at google dot com  2009-07-15 08:07 ---
(In reply to comment #6)
> Carrot, can you please try this test case with my patch
> "crossjump_abstract.diff" from Bug 20070 applied?
> 

I tried your patch. It did remove the redundant memory load. Following is the
output

push{lr}
ldr r3, [r1]
.L6:
str r3, [r0]
mov r2, r3  // M
cmp r3, #0
bne .L5
b   .L3
.L4:
ldr r3, [r3, #8]
b   .L6
.L5:
ldr r1, [r3, #4]
cmp r1, #0
beq .L4
.L3:
str r2, [r0, #12]
@ sp needed for prologue
pop {pc}

In pass ifcvt it noticed the difference of two stores is the different pseudo
register number and there is no conflict between the two pseudo registers, so
it rename one of them to the same as another and do basic block cross jump on
them earlier. Then pass iterate.c.161r.cse2 detected the redundant load and
remove it.

But it introduced another redundant move instruction marked as M. At the place
r2 is used, r3 still contain the same result as r2, so we can also use r3
there. I think this is another problem.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730



[Bug target/40783] New: inefficient code to accumulate function return values

2009-07-16 Thread carrot at google dot com
Compile the following code with options -Os -mthumb -march=armv5te

union FloatIntUnion {
float  fFloat;
int fSignBitInt;
};

static inline float fast_inc(float x) {
  union FloatIntUnion data;
  data.fFloat = x;
  data.fSignBitInt += 1;
  return data.fFloat;
}

extern int MyConvert(float);
extern float dumm();
int time_math() {
  int i;
  int sum = 0;
  const int repeat = 100;
  float f;

  f = dumm();
  for (i = repeat - 1; i >= 0; --i) {
  sum += (int)f; f = fast_inc(f);
  sum += (int)f; f = fast_inc(f);
  sum += (int)f; f = fast_inc(f);
  sum += (int)f; f = fast_inc(f);
  }

  f = dumm();
  for (i = repeat - 1; i >= 0; --i) {
sum += MyConvert(f); f = fast_inc(f);
sum += MyConvert(f); f = fast_inc(f);
sum += MyConvert(f); f = fast_inc(f);
  }
  return sum;
}

Gcc generates:

push{r4, r5, r6, r7, lr}
sub sp, sp, #12
bl  dumm
mov r4, #0
mov r6, #99
add r5, r0, #0
.L2:
add r0, r5, #0
bl  __aeabi_f2iz
add r5, r5, #1
add r4, r0, r4
add r0, r5, #0
bl  __aeabi_f2iz
add r5, r5, #1
add r4, r4, r0
add r0, r5, #0
bl  __aeabi_f2iz
add r5, r5, #1
add r4, r4, r0
add r0, r5, #0
bl  __aeabi_f2iz
add r5, r5, #1
add r4, r4, r0
sub r6, r6, #1
bcs .L2
bl  dumm
mov r6, #99
add r5, r0, #0
.L3:
add r0, r5, #0
bl  MyConvert
add r5, r5, #1
str r0, [sp, #4]
add r0, r5, #0
bl  MyConvert
add r5, r5, #1
mov r7, r0
add r0, r5, #0
bl  MyConvert
ldr r3, [sp, #4]
add r5, r5, #1
add r7, r7, r3
add r7, r7, r0
add r4, r4, r7
sub r6, r6, #1
bcs .L3
add sp, sp, #12
mov r0, r4
@ sp needed for prologue
pop {r4, r5, r6, r7, pc}

The source code contains 2 similar loops. But the generated code are quite
different. The code for first loop is as expected. After evaluating each
function, accumulates the returned value immediately. The code for second loop
is much worse. After evaluating each function, it saves the returned value to a
different place. After calling all functions in the same round of loop, it
accumulates all the saved results together. The code for second loop is larger
and slower, and even caused a register spilling.

The intermediate representation patterns for the two loops started to diverge
from pass float2int.c.078t.reassoc1. I don't know why gcc performs different
transforms on the two loops in this pass.


-- 
   Summary: inefficient code to accumulate function return values
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40783



[Bug target/40783] inefficient code to accumulate function return values

2009-07-16 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-17 06:56 ---
Created an attachment (id=18212)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18212&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40783



[Bug target/40815] New: redundant neg instruction caused by loop-invariant

2009-07-21 Thread carrot at google dot com
Compile following function with options -Os -mthumb -march=armv5te

void bar(char*, char*, int);
void foo(char* left, char* rite, int element)
{
  while (left <= rite)
  {
rite -= element;
bar(left, rite, element);
left += element;
  }
}

Gcc generates:

push{r3, r4, r5, r6, r7, lr}
mov r5, r0
mov r6, r1
mov r7, r2
neg r4, r2// A
b   .L2
.L3:
add r6, r6, r4// B
mov r0, r5
mov r1, r6
mov r2, r7
bl  bar
add r5, r5, r7
.L2:
cmp r5, r6
bls .L3
@ sp needed for prologue
pop {r3, r4, r5, r6, r7, pc}

Note that instruction A computes (r4 = -r2), and r4 is only used by instruction
B (r6 = r6 + r4), this can be simplified to (r6 = r6 - r7,  r7 contains the
original r2). Thus we can reduce one instruction.

Expression rite -= element was transformed to the following by the gimplify
pass

element.0 = (unsigned int) element;
D.2003 = -element.0;
rite = rite + D.2003;

This form was kept until pass neg.c.156r.loop2_invariant. Then the expression
-element was identified as loop invariant and hoisted out of the loop. So
caused the current result.

Is the transform of gimplify intended? Do we have any chance to merge the
previous expressions back to (rite = rite - element) before loop invariant
pass?


-- 
   Summary: redundant neg instruction caused by loop-invariant
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815



[Bug target/40815] redundant neg instruction caused by loop-invariant

2009-07-21 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-21 07:15 ---
Created an attachment (id=18234)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18234&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815



[Bug target/40815] redundant neg instruction caused by loop-invariant

2009-07-21 Thread carrot at google dot com


--- Comment #3 from carrot at google dot com  2009-07-21 07:35 ---
Created an attachment (id=18235)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18235&action=view)
dump of -fdump-rtl-expand-details


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815



[Bug target/40835] New: redundant comparison instruction

2009-07-23 Thread carrot at google dot com
Compile the following code with options -Os -mthumb -march=armv5te

int bar();
void goo(int, int);
void foo()
{
  int v = bar();
  if (v == 0)
return;
  goo(1, v);
}

Gcc generates:

push{r3, lr}
bl  bar
mov r1, r0
cmp r0, #0// *
beq .L1
mov r0, #1
bl  goo
.L1:
@ sp needed for prologue
pop {r3, pc}

The compare instruction is redundant since the previous move instruction has
already set the condition code according to the value of r0.


-- 
   Summary: redundant comparison instruction
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835



[Bug target/40835] redundant comparison instruction

2009-07-23 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-23 08:38 ---
Created an attachment (id=18241)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18241&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835



[Bug target/40835] redundant comparison instruction

2009-07-23 Thread carrot at google dot com


--- Comment #2 from carrot at google dot com  2009-07-24 02:11 ---
It seems HAVE_cc0 disabled for arm. What's the reason behind it?

A simple method is to add a peephole rule to handle it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835



[Bug target/40835] redundant comparison instruction

2009-07-24 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-07-24 07:37 ---
Just as I've figured out HAVE_cc0 is disabled. And cse_condition_code_reg does
nothing for thumb target.

I also found that the conditional branch instructions is always in the same
insn pattern as the previous compare instructions. So I even wonder there is
any way to express the optimized sequence (movs followed by bcc).

Is there any other places that I should take a look?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835



[Bug target/40900] New: redundant sign extend of short function returned value

2009-07-29 Thread carrot at google dot com
Compile the following code with options -Os -mthumb -march=armv5te

extern short shortv2();
short shortv1()
{
  return shortv2();
}

Gcc generates

push{r3, lr}
bl  shortv2
lsl r0, r0, #16// A
asr r0, r0, #16// B
pop {r3, pc}

The returned value in register r0 is already a sign extended short value, but
instructions A and B sign extend it again. So these two instructions are
redundant.


-- 
   Summary: redundant sign extend of short function returned value
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40900



[Bug target/40900] redundant sign extend of short function returned value

2009-07-29 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-07-29 08:57 ---
Created an attachment (id=18266)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18266&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40900



[Bug target/40956] New: GCSE opportunity in if statement

2009-08-03 Thread carrot at google dot com
Compile the following function with options -Os -mthumb -march=armv5te
-frename-registers

int foo(int p, int* q)
{
  if (p!=9)
*q = 0;
  else
*(q+1) = 0;
  return 3;
}

GCC generates:

push{lr}
cmp r0, #9 // D
beq .L2
mov r3, #0 // A
str r3, [r1]
b   .L3
.L2:
mov r0, #0 // B
str r0, [r1, #4]   // C
.L3:
mov r0, #3
pop {pc}

If we replace r0 with r3 in instructions B and C, then A and B will be same. So
we can move the same instruction before the instruction D and reduce 1
instruction.

Is it a gcse opportunity?


-- 
   Summary: GCSE opportunity in if statement
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40956



[Bug target/40956] GCSE opportunity in if statement

2009-08-03 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-08-03 22:55 ---
Created an attachment (id=18294)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18294&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40956



[Bug target/41004] New: missed merge of basic blocks

2009-08-07 Thread carrot at google dot com
Compile the attached source code with options -Os -march=armv5te -mthumb
Gcc generates following code snippet:

  ...
   cmp r0, r2
   bne .L5
   b   .L15<--- A
.L9:
   ldr r3, [r1]
   cmp r3, #0
   beq .L7
   str r0, [r1, #8]
   b   .L8
.L7:
   str r3, [r1, #8]
.L8:
   ldr r1, [r1, #4]
   b   .L12  < C
.L15:
   mov r0, #1   <--- B
.L12:
   cmp r1, r2   < D
   bne .L9
   ...

inst A jump to B then fall through to D
inst C jump to D

there is no other instructions jump to instruction B, so we can put inst B just
before A, then A jump to D, and C can be removed.

There are two possible functions can potentially do this optimization. They are
merge_blocks_move and try_forward_edges. 

Function try_forward_edges can only redirect a series of forwarder blocks. It
can't move the target blocks before the forwarder blocks.

In function merge_blocks_move only when both block b and c aren't forwarder
blocks then can they be merged. In this case block A is a forwarder block, so
they are not merged.


-- 
   Summary: missed merge of basic blocks
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004



[Bug target/41004] missed merge of basic blocks

2009-08-07 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-08-08 00:10 ---
Created an attachment (id=18326)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18326&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004



[Bug c/39989] New: [optimization]

2009-04-30 Thread carrot at google dot com
Compiling this code snippet with gcc for arm, 

typedef struct node node_t;
typedef struct node *node_p;
struct node
{
 int orientation;
 node_p pred;
 long depth;
};

node_t *primal_iminus(long *delta, node_t *iplus, node_t*jplus)
{
   node_t *iminus = 0;
   if( iplus->depth < jplus->depth )
   {
   if( iplus->orientation )
   iminus = iplus;
   iplus = iplus->pred;
   }
   return iminus;
}

I got:

   .save   {lr}
   push{lr}
.LCFI0:
.LVL0:
   .loc 1 13 0
   ldr r0, [r1, #8]
.LVL1:
   ldr r3, [r2, #8]
   cmp r0, r3
   bge .L2
   .loc 1 15 0
   ldr r2, [r1]
.LVL2:
   cmp r2, #0
   beq .L2
   mov r0, r1
.LVL3:
   b   .L3
.LVL4:
.L2:
   mov r0, #0
.LVL5:
.L3:
.LVL6:
   .loc 1 20 0
   @ sp needed for prologue
   pop {pc}

In which lr is still live at the exit of the function,  we can simply use BX 
lr to return and avoid the prolog instruction push {lr}.

The options I used is:
-fno-exceptions -Wno-multichar -march=armv5te -mtune=xscale -msoft-float -fpic
-mthumb-interwork -ffunction-sections -funwind-tables -fstack-protector
-fno-short-enums -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__
-D__ARM_ARCH_5TE__ -fmessage-length=0 -W -Wall -Wno-unused -DSK_RELEASE
-DNDEBUG -g -Wstrict-aliasing=2 -fgcse-after-reload -frerun-cse-after-loop
-frename-registers -DNDEBUG -UDEBUG -MD -O2 -Os -mthumb -fomit-frame-pointer
-fno-strict-aliasing -finline-limit=64 -finline-functions
-fno-inline-functions-called-once


-- 
   Summary: [optimization]
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989



[Bug target/39989] No need to save LR in some cases

2009-04-30 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-05-01 06:12 ---
Created an attachment (id=17787)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17787&action=view)
sample code showing the optimization opportunity


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989



[Bug target/39989] No need to save LR in some cases

2009-04-30 Thread carrot at google dot com


--- Comment #2 from carrot at google dot com  2009-05-01 06:21 ---
Actually gcc has already implemented this optimization, but it doesn't work for
this case.

Reload pass tries to determine the stack frame, so it needs to check the
push/pop lr optimization opportunity. One of the criteria is if there is any
far jump inside the function. Unfortunately at this time gcc can't decide each
instruction's length and basic block layout, so it can't know the offset of a
jump. To be conservative it assumes every jump is a far jump. So any jump in a
function will prevent this push/pop lr optimization.


-- 

carrot at google dot com changed:

   What|Removed |Added

 CC|            |carrot at google dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989



[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog

2009-05-03 Thread carrot at google dot com


--- Comment #6 from carrot at google dot com  2009-05-04 02:21 ---
We can compute the maximum possible function length first. If the length is not
large enough far jump is not necessary, and we can do this optimization safely.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570



[Bug target/56993] New: power gcc built 416.gamess generates wrong result

2013-04-17 Thread carrot at google dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56993



 Bug #: 56993

   Summary: power gcc built 416.gamess generates wrong result

Classification: Unclassified

   Product: gcc

   Version: 4.9.0

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: car...@google.com

  Host: powerpc-linux-gnu

Target: powerpc-linux-gnu

 Build: powerpc-linux-gnu





When I use the trunk gcc to run spec2006 416.gamess, I got the following error



$ runspec --config=test.cfg --tune=base --size=test --nofeedback --noreportable

game

runspec v6152 - Copyright 1999-2008 Standard Performance Evaluation Corporation

Using 'linux-ydl23-ppc' tools

Reading MANIFEST... 18357 files

Loading runspec modules

Locating benchmarks...found 31 benchmarks in 6 benchsets.

Reading config file '/usr/local/google/carrot/spec2006/config/test.cfg'

Benchmarks selected: 416.gamess

Compiling Binaries

  Building 416.gamess base Linux64 default: (build_base_Linux64.)



Build successes: 416.gamess(base)



Setting Up Run Directories

  Setting up 416.gamess test base Linux64 default: created

(run_base_test_Linux64.)

Running Benchmarks

  Running (#1) 416.gamess test base Linux64 default





Contents of exam29.err



STOP IN ABRT







*** Miscompare of exam29.out; for details see

   

/usr/local/google/carrot/spec2006/benchspec/CPU2006/416.gamess/run/run_base_test_Linux64./exam29.out.mis

Invalid run; unable to continue.

If you wish to ignore errors please use '-I' or ignore_errors



The log for this run is in

/usr/local/google/carrot/spec2006/result/CPU2006.111.log

The debug log for this run is in

/usr/local/google/carrot/spec2006/result/CPU2006.111.log.debug



*

* Temporary files were NOT deleted; keeping temporaries such as

* /usr/local/google/carrot/spec2006/result/CPU2006.111.log.debug and

* /usr/local/google/carrot/spec2006/tmp/CPU2006.111

* (These may be large!)

*

runspec finished at Wed Apr 17 16:37:27 2013; 93 total seconds elapsed







My gcc is configured as



$ gcc -v

Using built-in specs.

COLLECT_GCC=gcc

COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc-linux-gnu/4.6/lto-wrapper

Target: powerpc-linux-gnu

Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.2-12'

--with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs

--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr

--program-suffix=-4.6 --enable-shared --enable-linker-build-id

--with-system-zlib --libexecdir=/usr/lib --without-included-gettext

--enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6

--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug

--enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-secureplt

--disable-softfloat --enable-targets=powerpc-linux,powerpc64-linux

--with-cpu=default32 --with-long-double-128 --enable-checking=release

--build=powerpc-linux-gnu --host=powerpc-linux-gnu --target=powerpc-linux-gnu

Thread model: posix

gcc version 4.6.2 (Debian 4.6.2-12)





GCC4.8 has the same error, but gcc4.7 is good.


[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2

2012-09-06 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398

Carrot  changed:

   What|Removed |Added

 CC||carrot at google dot com

--- Comment #4 from Carrot  2012-09-07 01:19:31 UTC 
---
The code before the position Ahmad pointed out is already wrong.

The fault instruction sequence is:

   asrsr5, r5, #1
   asr ip, ip, #1  ; A, tmp1.x
   asrsr0, r0, #1 ; B, tmp1.y
   asrsr6, r6, #1
   mov r4, r1
   add r8, ip, r6  ; C, tmp3.x
   add r9, r0, r5  ; D, tmp3.y
   add r7, sp, #0
   asr r1, r8, #1
   add ip, r4, #8  ; E,
   asr r9, r9, #1
   str r1, [r7, #16]
   str r9, [r7, #20]
   ldmia   r3, {r0, r1}; F,
   stmia   r4, {r0, r1}

Instruction A computes the result of tmp1.x, instruction C use it to compute
tmp3.x, instruction E overwrite the value of tmp1.x. But in the source code,
tmp1.x is still needed to execute "dst1->p2 = tmp1;", so at last dest1->p2.x
gets garbage.

Similarly instruction B computes tmp1.y, instruction D uses it to compute
tmp3.y, instruction F overwrites it. After executing "dst1->p2 = tmp1;",
dst1->p2.y gets another garbage value.


For comparison, following is the correct version

   asrsr7, r7, #1; A, tmp1.x
   asrsr0, r0, #1; B, tmp1.y
   asrsr6, r6, #1
   asrsr5, r5, #1
   sub sp, sp, #28
   mov r4, r1
   add r8, r7, r6; C, tmp3.x
   add ip, r0, r5; D, tmp3.y
   str r7, [sp, #0]  ; X, save tmp1.x
   str r0, [sp, #4]  ; Y, save tmp1.y
   asr r1, ip, #1
   add r7, r4, #8   ; E
   asr r8, r8, #1
   str r1, [sp, #20]
   str r8, [sp, #16]
   ldmia   r3, {r0, r1}  ; F
   stmia   r4, {r0, r1}

The obvious difference is the extra instructions X and Y, they save the value
of tmp1 to stack before reusing the register.

The simplified preprocessed source code is


struct A
{
  int x;
  int y;

  void f(const A &a, const A &b)
  {
  x = (a.x + b.x)>>1;
  y = (a.y + b.y)>>1;
  }
};

class C {
public:
A p1;
A p2;
A p3;

bool b;
void g(C *, C *) const;
};

void C::g(C *dst1, C *dst2) const
{
 A tmp1, tmp2, tmp3;

 tmp1.f(p2,p1);
 tmp2.f(p2,p3);
 tmp3.f(tmp1, tmp2);

 dst1->p1 = p1;
 dst1->p2 = tmp1;
 dst1->p3 =
 dst2->p1 = tmp3;
 dst2->p2 = tmp2;
 dst2->p3 = p3;
}

The simplified command line is:

./cc1plus -fpreprocessed t.ii -quiet -dumpbase t.cpp -mthumb "-march=armv7-a"
"-mtune=cortex-a15" -auxbase t -O2 -fno-omit-frame-pointer -o t.s

It looks like the dse2 pass did wrong transformation.

The gcc4.7 and trunk generate correct code.


[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2

2012-09-10 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398

--- Comment #5 from Carrot  2012-09-11 00:10:45 UTC 
---
It's the bug in local dse sub step in dse.c.

 66 (insn/f 70 69 71 2 (set (reg/f:SI 7 r7)
 67 (plus:SI (reg/f:SI 13 sp)
 68 (const_int 0 [0]))) t.ii:24 -1
 69  (nil))
This insn setup the hfp, r7

197 (insn 12 30 17 2 (set (mem/s/c:SI (reg/f:SI 7 r7) [4 tmp1.x+0 S4 A64])
198 (reg:SI 12 ip [orig:137 D.1799 ] [137])) t.ii:8 694
{*thumb2_movsi_insn}
199  (nil))
This is the store instruction, the memory base address is r7, the hfp register,
dse think hfp is constant inside the function, so give it a store group

221 (insn 37 36 34 2 (set (reg/f:SI 8 r8 [170])
222 (reg/f:SI 7 r7)) t.ii:32 694 {*thumb2_movsi_insn}
223  (expr_list:REG_EQUIV (plus:SI (reg/f:SI 7 r7)
224 (const_int 0 [0]))
225 (nil)))
This insn move r7 to r8, it also equals to the value of sp

245 (insn 38 35 39 2 (parallel [
246 (set (reg:SI 0 r0)
247 (mem/s/c:SI (reg/f:SI 8 r8 [170]) [3 tmp1+0 S4 A64]))
248 (set (reg:SI 1 r1)
249 (mem/s/c:SI (plus:SI (reg/f:SI 8 r8 [170])
250 (const_int 4 [0x4])) [3 tmp1+4 S4 A32]))
251 ]) t.ii:32 369 {*ldm2_ia}
252  (nil))
This is the load instruction, the memory base address is r8, const_or_frame_p
returns false for r8, after using cselib_expand_value_rtx to r8, we get a base
address sp, const_or_frame_p still return false for it. So the corresponding
group id is -1 (no corresponding store group), then it can't match the store
insn 12. So dse consider the memory stored in insn 12 is never used, thus a
dead store, and can be eliminated.

The problem is the hfp based address is considered constant base address, sp
and derived addresses are considered varied base address, they will not be
matched when detecting interfering memory access. But in many cases sp and hfp
can be same. Even worse, addresses copied from or derived from hfp could be
recognized as derived from sp, like in this case, and causes memory access
mismatch.


[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2

2012-09-12 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398

--- Comment #8 from Carrot  2012-09-12 20:57:33 UTC 
---
(In reply to comment #7)
> 
> This rings a bell.
> 
> Maybe the patch mentioned below needs backporting given Carrot is 
> reporting this against the 4.6 branch. What's not clear if this is 
> reproducible on anything later though.
> 
> http://old.nabble.com/-PATCH--Prevent-cselib-substitution-of-FP,-SP,-SFP-td33080657.html
> 
The patch can fix this bug.


[Bug c++/54574] New: G++ accepts parameters with wrong types in parent constructor

2012-09-13 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54574

 Bug #: 54574
   Summary: G++ accepts parameters with wrong types in parent
constructor
Classification: Unclassified
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com


When compiling the following source code,

class C
{
public:
  C (int* Items[]);
};

template class A : public C
{
public:
  A (int Items[])
: C (Items) {// C is called with wrong parameter type, expects
int**
  };
};

int i[5];
A yyy(i);

Trunk g++ silently accepts it. While clang produces following error message:


cursesm.ii:11:7: error: no matching constructor for initialization of 'C'
: C (Items) {
  ^  ~
cursesm.ii:4:3: note: candidate constructor not viable: no known conversion
from 'int *' to 'int **' for 1st argument; take the address of the argument
with &
  C (int* Items[]);
  ^
cursesm.ii:1:7: note: candidate constructor (the implicit copy constructor) not
viable: no known conversion from 'int *' to 'const C' for 1st argument;
class C
  ^
cursesm.ii:10:3: error: constructor for 'A' must explicitly initialize the
base class 'C' which does not have a default constructor
  A (int Items[])
  ^
cursesm.ii:16:8: note: in instantiation of member function 'A::A'
requested here
A yyy(i);
   ^
cursesm.ii:1:7: note: 'C' declared here
class C
  ^

It also impacts branches 4.6 and 4.7.


[Bug middle-end/41004] missed merge of basic blocks

2009-08-19 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-08-19 21:55 ---
(In reply to comment #2)
> Why does the basic block reordering pass also not handle this?
> 

Basic block reordering is disabled with options -Os. 

The basic block reordering algorithm is for performance only, it usually
increases code size. So it won't be called when do optimization for size. But
for this specific case, the extra branch can be removed when I compile it with
-O2.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004



[Bug c++/3187] gcc lays down two copies of constructors

2009-08-26 Thread carrot at google dot com


--- Comment #34 from carrot at google dot com  2009-08-27 01:40 ---
There is one optimization that we can do without affecting the ABI and linker
compatibility. The delete destructor(D0) always contains the content of
complete desturctor(D1) followed by a function call to delete. So instead of
cloning the abstract destructor function body to the delete destructor(D0), we
can generate a function call to complete destructor(D1) followed by a function
call to delete.


-- 

carrot at google dot com changed:

   What|Removed |Added

 CC||carrot at google dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3187



[Bug middle-end/41396] New: missed space optimization related to basic block reorder

2009-09-18 Thread carrot at google dot com
Compile the attached source code with options -march=armv5te -mthumb -Os, I got

push{r4, lr}
ldr r4, [r0, #8]
ldr r3, [r0, #4]
b   .L2
.L7:
ldr r2, [r3, #8]
ldr r1, [r2]
ldr r2, [r3]
add r2, r1, r2
ldr r1, [r3, #4]
ldr r1, [r1]
sub r2, r2, r1
ldr r1, [r3, #12]
cmp r1, #1
beq .L4
cmp r1, #2
bne .L3
b   .L12   // C
.L4:   // -BEGIN BLOCK B
ldr r1, [r0]
neg r1, r1
cmp r2, r1
bge .L3
b   .L9// --END BLOCK B
.L12:  // ---BEGIN BLOCK A---
ldr r1, [r0]
cmp r2, r1
bgt .L9
.L3:
add r3, r3, #16
.L2:
cmp r3, r4
bcc .L7
mov r0, #0
b   .L6  // -END BLOCK A-
.L9:
mov r0, #1
.L6:
@ sp needed for prologue
pop {r4, pc}


If we change the order of block A and block B, we can remove 2 branch
instructions, inst C and another inst at the end of block B.

Need new basic block reorder algorithm for code size optimization?


-- 
   Summary: missed space optimization related to basic block reorder
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396



[Bug middle-end/41396] missed space optimization related to basic block reorder

2009-09-18 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-09-18 07:57 ---
Created an attachment (id=18602)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18602&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396



[Bug tree-optimization/41442] New: missed optimization for boolean expression

2009-09-22 Thread carrot at google dot com
The boolean expression ((p1->next && !p2->next) || p2->next) can be simplified
as (p1->next || p2->next), but gcc failed to detect this.


The attached source code is an example, compile it with options -Os
-march=armv5te -mthumb, I got

push{lr}
ldr r3, [r0]
cmp r3, #0  
beq .L2 
ldr r3, [r1]// redundant load and comparison
mov r0, #0  
cmp r3, #0  //
beq .L3 // can branch to L3 directly
.L2:
ldr r0, [r1]
neg r3, r0
adc r0, r0, r3
.L3:
@ sp needed for prologue
pop {pc}


-- 
   Summary: missed optimization for boolean expression
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41442



[Bug tree-optimization/41442] missed optimization for boolean expression

2009-09-22 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-09-23 06:49 ---
Created an attachment (id=18634)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18634&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41442



[Bug target/41481] New: missed optimization in cse

2009-09-27 Thread carrot at google dot com
Compile following code with options -Os -march=armv5te -mthumb,

class A
{
 public:
  int ah;
  unsigned field : 2;
};

void foo(A* p)
{
  p->ah = 1;
  p->field = 1;
}

We can get:

mov r3, #1 // A
str r3, [r0]
ldrbr3, [r0, #4]
mov r2, #3
bic r3, r3, r2
mov r2, #1 // B
orr r3, r3, r2
strbr3, [r0, #4]
@ sp needed for prologue
bx  lr

Both instruction A and B load a constant 1 into register. We can load 1 into r1
in instruction A and use r1 when constant 1 is required. So instruction B can
be removed.

cse pass doesn't find this opportunity is because it needs all expressions to
be of the same mode. But in rtl level the first 1 is in mode SI and the second
1 is in mode QI. Arm doesn't has any physical register of QI mode, so all of
them are put into 32 bit physical register and causes redundant load of
constant 1.


-- 
   Summary: missed optimization in cse
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41481



[Bug target/41481] missed optimization in cse

2009-09-27 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-09-27 09:13 ---
Created an attachment (id=18662)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18662&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41481



[Bug target/41514] New: redundant compare instruction of consecutive conditional branches

2009-09-30 Thread carrot at google dot com
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc
generates:

push{lr}
cmp r0, #63 // A
beq .L3
cmp r0, #63 // B
bhi .L4
cmp r0, #45
beq .L3
cmp r0, #47
bne .L5
b   .L3
.L4:
cmp r0, #99
bne .L5
.L3:
mov r0, #1
b   .L2
.L5:
mov r0, #0
.L2:
@ sp needed for prologue
pop {pc}

Instruction B is the same as instruction A, and there are no other instructions
between them clobber condition codes. So we can remove instruction B.


-- 
   Summary: redundant compare instruction of consecutive conditional
branches
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514



[Bug target/41514] redundant compare instruction of consecutive conditional branches

2009-09-30 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-09-30 08:25 ---
Created an attachment (id=18671)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18671&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514



[Bug target/41514] redundant compare instruction of consecutive conditional branches

2009-10-01 Thread carrot at google dot com


--- Comment #3 from carrot at google dot com  2009-10-01 07:37 ---
(In reply to comment #2)
> Where does it come from? (Remember: option -dAP, then look at .s file)
> 

The first several instructions and corresponding rtl patterns are:

cmp r0, #63
beq .L3
cmp r0, #63
bhi .L4

(jump_insn 8 3 35 src/./test5.c:3 (set (pc)
(if_then_else (eq (reg/v:SI 0 r0 [orig:135 ch ] [135])
(const_int 63 [0x3f]))
(label_ref 18)
(pc))) 201 {*cbranchsi4_insn} (expr_list:REG_BR_PROB (const_int
2900 [0xb54])
(nil))
 -> 18)

(note 35 8 9 [bb 3] NOTE_INSN_BASIC_BLOCK)

(jump_insn 9 35 36 src/./test5.c:3 (set (pc)
(if_then_else (gtu (reg/v:SI 0 r0 [orig:135 ch ] [135])
(const_int 63 [0x3f]))
(label_ref 14)
(pc))) 201 {*cbranchsi4_insn} (expr_list:REG_BR_PROB (const_int
5000 [0x1388])
(nil))
 -> 14)

In thumb's instruction patterns, compare and branch instructions can't be
expressed separately. So we can't easily remove the second compare instruction
in middle end.

I just noticed the second conditional branch (larger than 63) is totally
unnecessary if we compare the equality with 63, 45, 47, 99 one by one. This is
another missed optimization exposed by this test case.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514



[Bug target/41653] New: not optimal result for multiplication with constant when -Os is specified

2009-10-10 Thread carrot at google dot com
Compile the following code with options -Os -mthumb -march=armv5te

int mul12(int x)
{
  return x*12;
}

Gcc generates:

lsl r3, r0, #1
add r0, r3, r0
lsl r0, r0, #2
@ sp needed for prologue
bx  lr

This code sequence may be good for speed. But when we optimize for size, we can
get shorter code sequence:

mov  r3, 12
mul  r0, r3, r0
bx   lr

These code is generated by the expand pass. We may consider to generate
different instructions when optimize for size.

This kind of multiplication is usually found in computing the address of an
array element.


-- 
   Summary: not optimal result for multiplication with constant when
-Os is specified
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41653



[Bug target/41705] New: missed if conversion optimization

2009-10-14 Thread carrot at google dot com
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc
generates:

push{lr}
ldr r3, [r0]
cmp r3, #0   // B
bne .L3
ldr r3, [r0, #4]
b   .L2
.L3:
mov r3, #0   // A
.L2:
ldr r2, [r0, #8]
@ sp needed for prologue
ldr r0, [r2]
add r0, r3, r0
pop {pc}

Instruction A can be moved before instruction B, which should be handled by
ifcvt.c:find_if_case_2. Notice the following code in find_if_header:

  if (dom_info_state (CDI_POST_DOMINATORS) >= DOM_NO_FAST_QUERY
  && (! HAVE_conditional_execution || reload_completed))
{
  if (find_if_case_1 (test_bb, then_edge, else_edge))
goto success;
  if (find_if_case_2 (test_bb, then_edge, else_edge))
goto success;
}

After reload_completed, the target of conditional assignment is happened to be
allocated to the same physical register as the condition variable. This prevent
it from moving to the front of compare and branch instructions.

Before reload_completed, HAVE_conditional_execution prevent find_if_case_2 to
be called. So we missed this optimization chance.

Target ARM has conditional execution capability, but thumb actually can't do
conditional execution. Do we have any method to let the compiler know this?


-- 
   Summary: missed if conversion optimization
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705



[Bug target/41705] missed if conversion optimization

2009-10-14 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-10-14 09:29 ---
Created an attachment (id=18798)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18798&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705



[Bug target/41653] not optimal result for multiplication with constant when -Os is specified

2009-10-15 Thread carrot at google dot com


--- Comment #2 from carrot at google dot com  2009-10-15 08:18 ---
arm_size_rtx_costs calls thumb1_rtx_costs for TARGET_THUMB1.

thumb1_rtx_costs is also called by several other functions. Looked at its
implementation briefly, it is actually tuned for speed only. Following are some
obvious example:

case UDIV:
case UMOD:
case DIV:
case MOD:
  return 100;

case TRUNCATE:
  return 99;

So a new function thumb1_size_rtx_costs is required to model the thumb1 size
feature, right?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41653



[Bug target/41705] missed if conversion optimization

2009-10-15 Thread carrot at google dot com


--- Comment #3 from carrot at google dot com  2009-10-15 08:25 ---
> 
> > 
> > Target ARM has conditional execution capability, but thumb actually can't do
> > conditional execution. Do we have any method to let the compiler know this?
> 
> Note that this is relevant only for Thumb1 and not for Thumb2. Thumb2 has
> conditional code generation and GCC does make an effort to generate 
> conditional
> code for it.
> 
> Can we work around this by undef'ing HAVE_conditional_execution in the backend
> headers and defining this to TARGET_THUMB1 ? 
> 

I will try this method, thank you.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705



[Bug tree-optimization/41778] New: missed dead store elimination

2009-10-21 Thread carrot at google dot com
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc
generates:

push{lr}
ldr r3, [r1, #4]  // redundant
ldrbr3, [r3]  // redundant
@ sp needed for prologue
pop {pc}

There are two redundant instructions.
Compile it with options -O2 -march=armv5te -mthumb, gcc generates following
expected results.

foo:
@ sp needed for prologue
bx  lr

The optimization done in -O2 is from this patch
http://gcc.gnu.org/viewcvs?view=revision&revision=145172. But this piece of
code was in pre pass, it is disabled when -Os is specified, so the unoptimized
code was passed to rtl passes. In rtl passes, the dead store is caught and
removed with some related code, but not all of them were removed, so we can
still see the two redundant instruction.

We should also add this optimization to dead store elimination pass to benefit
-Os cases.


-- 
   Summary: missed dead store elimination
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: carrot at google dot com
 GCC build triplet: i686-linux
  GCC host triplet: i686-linux
GCC target triplet: arm-eabi


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41778



[Bug tree-optimization/41778] missed dead store elimination

2009-10-21 Thread carrot at google dot com


--- Comment #1 from carrot at google dot com  2009-10-21 08:50 ---
Created an attachment (id=18850)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18850&action=view)
test case


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41778



[Bug middle-end/41762] internal compiler error when compiling xorg-server

2009-10-23 Thread carrot at google dot com


--- Comment #9 from carrot at google dot com  2009-10-23 09:15 ---
(In reply to comment #5)
> This is fixed on trunk by revision 149082:
> 
> http://gcc.gnu.org/ml/gcc-cvs/2009-06/msg01067.html
> 

The patch 149082 contains two parts: 1. fixed a wrong optimization in
tree-ssa-sink.c, it affects performance only. 2. fixed a i386 back end bug in
i386.c.

I've tried the bug fixing code in i386.c, unfortunately it doesn't work. So it
looks more like the better optimization in the patch hides an unknown bug.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41762



[Bug target/41705] missed if conversion optimization

2009-10-27 Thread carrot at google dot com


--- Comment #4 from carrot at google dot com  2009-10-27 09:15 ---
A patch http://gcc.gnu.org/viewcvs?view=revision&revision=153584 has been
checked in.


-- 

carrot at google dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||FIXED


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705



[Bug target/47133] New: code size opportunity for boolean expression evaluation

2010-12-31 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47133

   Summary: code size opportunity for boolean expression
evaluation
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
Target: arm-eabi


Compile the following code with options -march=armv7-a -mthumb -Os

struct S
{
  int f1, f2;
};

int t04(int x, struct S* p)
{
  return p->f1 == 9 && p->f2 == 0;
}

GCC 4.6 generates:

t04:
ldrr3, [r1, #0]
cmpr3, #9 // A
bne.L3
ldrr0, [r1, #4]
rsbsr0, r0, #1
itcc
movccr0, #0
bxlr // C
.L3:
movsr0, #0 // B
bxlr


Instruction B can be moved before instruction A, and instruction C can be
removed. 

t04:
ldrr3, [r1, #0]
movsr0, #0
cmpr3, #9
bne.L3
ldrr0, [r1, #4]
rsbsr0, r0, #1
itcc
movccr0, #0
.L3:
bxlr

When compiled to arm instructions, it has the same problem.

It should be enabled for code size optimization only because it may execute one
more instruction run time.

Looks like an if-conversion opportunity.


[Bug rtl-optimization/47373] New: avoid goto table to reduce code size when optimized for size

2011-01-20 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47373

   Summary: avoid goto table to reduce code size when optimized
for size
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
  Host: linux
Target: arm-linux-androideabi


Created attachment 23040
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23040
modified testcase

When I compiled the infback.c from zlib 1.2.5 with options -march=armv7-a
-mthumb -Os, gcc 4.6 generates following code for a large switch statement:

subsr3, r3, #11
cmpr3, #18
bhi.L16
tbh[pc, r3, lsl #1]
.L23:
.2byte(.L17-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L18-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L154-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L20-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L16-.L23)/2
.2byte(.L21-.L23)/2
.2byte(.L121-.L23)/2
.L17:

GCC generates a goto table for 19 cases. The table and the instructions which
manipulate it occupies 19*2 + 10 = 48 bytes.

Actually most of the targets in the table are same. There are only 6 targets
other than .L16. So if we generate a sequence of cmp & br instructions, we need
only 6 cmp&br and one br to default, that's only 4*6+2=26 bytes.

When I randomly modified the source code, gcc sometimes generate the absolute
address in the goto table, double the table size, make result worse. The
modified source code is attached.


[Bug rtl-optimization/47454] New: registers are not allocated according to its preferred order

2011-01-25 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47454

   Summary: registers are not allocated according to its preferred
order
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
Target: arm-linux-androideabi


Created attachment 23115
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23115
testcase

The attached test case is extracted from zlib, when compiled by gcc4.6 with
options -march=armv7-a -mthumb -Os, I got the following code

push{r4, r5, r6, r7, r8, lr}
movr6, r0
movr4, r1
cmpr0, #0
beq.L2
ldrr5, [r0, #0]
cmpr5, #0
beq.L2
cmpr1, #0
bge.L3
negsr4, r1
movsr7, #0
b.L4
.L3:
asrsr7, r1, #4
addsr7, r7, #1
cmpr1, #47
itle
andler4, r1, #15
.L4:
addsr3, r4, #0
subr8, r4, #8
itne
movner3, #1
cmpr8, #7
itels
movlsr8, #0
andhir8, r3, #1
cmpr8, #0
bne.L2
ldrr1, [r5, #8]
cbzr1, .L5
ldrr3, [r5, #4]
cmpr3, r4
beq.L5
ldrr3, [r6, #4]
ldrr0, [r6, #8]
blxr3
strr8, [r5, #8]
.L5:
strr7, [r5, #0]
movr0, r6
strr4, [r5, #4]
pop{r4, r5, r6, r7, r8, lr}
binflateReset
.L2:
mvnr0, #1
pop{r4, r5, r6, r7, r8, pc}

Note that register r8 is used many times, but register r2 is never used. In
thumb2 r8 is high register, its usage will cause 32bit instructions. If we
replace r8 with r2, a lot of code size will be reduced in this case.

In arm.h REG_ALLOC_ORDER is defined as
3,  2,  1,  0, 12, 14,  4,  5, 6,  7,  8, 10,  9, 11, 13, 15 ...

We can see that r2 should be used before r8, but the result is not.


[Bug rtl-optimization/47454] registers are not allocated according to its preferred order

2011-01-31 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47454

--- Comment #3 from Carrot  2011-01-31 08:48:40 UTC 
---
(In reply to comment #2)
> -frename-registers should help for this issue on the ARM.

All of r8 can be renamed to r2, in this case only two of them have been
renamed.


[Bug target/47764] New: The constant load instruction should be hoisted out of loop

2011-02-15 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

   Summary: The constant load instruction should be hoisted out of
loop
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
Target: arm-linux-androideabi


Created attachment 23359
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23359
testcase

The attached test case is extracted from zlib. Compile it with options
-march=armv7-a -mthumb -Os, gcc 4.6 generates:

init_block:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movsr3, #0
.L2:
addsr2, r0, r3
addsr3, r3, #4
movsr1, #0   // A
cmpr3, #1144
strhr1, [r2, #60]@ movhi  // B
bne.L2
movsr3, #0
.L3:
addsr2, r0, r3
addsr3, r3, #4
movsr1, #0   // C
cmpr3, #120
strhr1, [r2, #2352]@ movhi
bne.L3
movsr2, #0
.L4:
addsr1, r0, r2
addsr2, r2, #4
movsr3, #0   // D
cmpr2, #76
strhr3, [r1, #2596]@ movhi
bne.L4
movsr2, #1
strr3, [r0, #2760]
strhr2, [r0, #1084]@ movhi
strr3, [r0, #2756]
strr3, [r0, #2764]
strr3, [r0, #2752]
bxlr

Note that instruction A in loop L2 loads constant 0 to register r1, then
instruction B stores r1 into memory. There is no other usage of r1 in the loop.
So it's better to move instruction A out of the loop.

Similarly instruction C can be moved out of loop L3. Actually it can be removed
since after instruction A the register r1 already contains 0 and no instruction
modify it later.

Similarly instruction D cam be moved out of loop L4. It can also be removed if
we exchange the register usage of r1 and r3 in loop L4.


[Bug target/47777] New: use __aeabi_idivmod to compute quotient and remainder at the same time

2011-02-16 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=4

   Summary: use __aeabi_idivmod to compute quotient and remainder
at the same time
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
Target: arm-eabi


Compile the following source code with options -march=armv7-a -O2

int t06(int x, int y)
{
  int a = x / y;
  int b = x % y;
  return a+b;
}

GCC 4.6 generates:

t06:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfdsp!, {r4, r5, r6, lr}
movr6, r0
movr5, r1
bl__aeabi_idiv
movr1, r5
movr4, r0
movr0, r6
bl__aeabi_idivmod
addr0, r4, r1
ldmfdsp!, {r4, r5, r6, pc}

It calls function __aeabi_idiv to compute quotient and calls __aeabi_idivmod to
compute remainder. Actually __aeabi_idivmod can compute quotient and remainder
at the same time. By taking advantage of this we can simplify the code to

push{r4, lr}
bl__aeabi_idivmod
addr0, r0, r1
pop{r4, pc}


[Bug target/47764] The constant load instruction should be hoisted out of loop

2011-02-20 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764

--- Comment #3 from Carrot  2011-02-21 03:15:45 UTC 
---
> Any ideas of how this improvement could be implemented, Carrot?

The root cause of this problem is that arm/thumb store instruction can't
directly store a immediate number to memory, but gcc doesn't realize this early
enough. In most part of the rtl phase, the following form is kept.

  (insn 41 38 42 3 (set (mem:HI (plus:SI (reg/f:SI 169)
  (const_int 60 [0x3c])) [2 MEM[(struct deflate_state *)D.2085 
  _3 + 60B]+0 S2 A16])
  (const_int 0 [0])) src/trees.c:45 696 {*thumb2_movhi_insn}
   (expr_list:REG_DEAD (reg/f:SI 169)
  (nil)))

Until register allocation it finds the restriction of the store instruction and
split it into two instructions, load 0 into register and store register to
memory. But it's too late to do a loop optimization.

One possible method is to split this insn earlier than loop optimization (maybe
directly in expand pass), and let loop and cse optimizations do the rest. It
may increase register pressure in part of the program, we should rematerialize
it in such cases.


[Bug target/47831] New: avoid if-convertion if the conditional instructions and following conditional branch has the same condition

2011-02-21 Thread carrot at google dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47831

   Summary: avoid if-convertion if the conditional instructions
and following conditional branch has the same
condition
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: car...@google.com
Target: arm-linux-androideabi


Created attachment 23423
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23423
testcase

Compile the attached source code with options -march=armv7-a -mthumb -Os, GCC
4.6 generates

ras_validate:
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
push{r0, r1, r4, r5, r6, lr}
addr4, sp, #4
movsr2, #4
movr1, r4
movr5, r0
blfoo
cmpr0, #0
itge// A
movger6, r0// B
bge.L3   // C
b.L7   // D
.L4:
addsr3, r6, r4
movr0, r5
subsr6, r6, #1
ldrbr1, [r3, #-1]@ zero_extendqisi2
blbar
addsr3, r0, #1
beq.L2
.L3:
cmpr6, #0
bne.L4
movr0, r6
b.L2
.L7:
movr0, #-1
.L2:
pop{r2, r3, r4, r5, r6, pc}

Instruction sequence ABCD can be replaced with

   blt.L7
   movr6, r0
   b  .L3

In both cases (lt or ge) the executed instructions is not longer than original
code. So it's shorter and faster.


  1   2   3   >