[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog
--- Comment #8 from carrot at google dot com 2009-05-04 10:08 --- Sorry for my ignorance to gcc. What types of instructions reload will add? Spilling and loading registers? and more? By reading the the implementation of thumb_far_jump_used_p() I can get the conclusion that if reload thinks there is a far jump, later pass won't change this decision. But if reload thinks there is no far jump, later pass still need to re-check the far jump existence and may change this decision. So if reload occasionally makes a wrong decision later pass should correct it, is it right? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570
[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog
--- Comment #10 from carrot at google dot com 2009-05-05 15:32 --- (In reply to comment #9) > (In reply to comment #8) > > Sorry for my ignorance to gcc. What types of instructions reload will add? > > Spilling and loading registers? and more? > > > That's pretty much it, but... Before register spilling, it must have used up all physical registers, including callee saved registers. Any saving of callee saved register should already have disabled this optimization. > > > By reading the the implementation of thumb_far_jump_used_p() I can get the > > conclusion that if reload thinks there is a far jump, later pass won't > > change > > this decision. But if reload thinks there is no far jump, later pass still > > need > > to re-check the far jump existence and may change this decision. So if > > reload > > occasionally makes a wrong decision later pass should correct it, is it > > right? > > > > > Once reload has completed we can't change the decision as to whether or not LR > gets saved onto the stack or not. Unfortunately, that doesn't play well with > constant pools. We sometimes need to inline these, and that might cause > branches to be pushed out of range. Since we don't inline the pools until > after reload has completed, that's a major stumbling block. The current code > just isn't aware of these issues. > It looks like a bug in current code and my patch tries to exploit it. We should fix it by checking far jump (or thumb_force_lr_save) in reload pass only and simply get this computed value in later pass. It looks computing the exact limit is very difficult if not impossible. Could we simply use a predefined constant which is much much smaller than the far jump threshold as the limit? For example, use the constant 256 which is only 1/8 of the far jump threshold. I don't expect a larger function can have any chance to satisfy other conditions: leaf function and doesn't use any callee saved registers. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570
[Bug rtl-optimization/40314] New: inefficient address calculation of fields of large struct
Given a structure and 3 field access typedef struct network { char inputfile[400]; int* nodes; int* arcs; int* stop_arcs; } network_t; int *arc; int *node = net->nodes; <--- A void *stop = (void *)net->stop_arcs; <--- B for( arc = net->arcs; arc != (int *)stop; arc++ ) <--- C GCC generates following instruction sequence in thumb mode with options -O2 -Os, it needs 9 insts to load 3 fields mov r2, #200 <--- A1 lsl r1, r2, #1 <--- A2 .loc 1 14 0 mov r4, #204 < B1 lsl r3, r4, #1 <--- B2 .loc 1 13 0 ldr r2, [r0, r1] < A3 .loc 1 15 0 mov r1, #202 <--- C1 .loc 1 14 0 ldr r4, [r0, r3] <--- B3 .loc 1 15 0 lsl r3, r1, #1<--- C2 ldr r3, [r0, r3] <--- C3 A better method is adjusting the base address first, which is nearer to all 3 fields we will access. Then we can use ldr dest, [base, offset] to load each fields with only 1 instruction. Although this opportunity is found in target ARM, it should also be applicable to other architectures with addressing mode of (base + offset) and offset has a limited value range. -- Summary: inefficient address calculation of fields of large struct Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314
[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct
--- Comment #1 from carrot at google dot com 2009-05-31 02:42 --- Created an attachment (id=17940) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17940&action=view) test case to show the opportunity -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314
[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct
--- Comment #2 from carrot at google dot com 2009-05-31 02:51 --- There are a lot of such opportunities in mcf from SPEC CPU 2006. One possible implementation is to add a pass before cse. In the new pass it should detect insn patterns like: (set r200 400) # 400 is offset of field1 (set r201 (mem (plus r100 r200))) # r100 contains struct base ... (set r300 404) # 404 is offset of field2 (set r301 (mem (plus r100 r300))) # r100 contains struct base And rewrite them as: (set r200 400) # keep the original insn (set r250 (plus r100 400)) # r250 is new base (set r201 (mem (plus r250 0))) ... (set r300 404) (set r251 (plus r100 400)) # r251 contains same value as r250 (set r301 (mem (plus r251 4))) We can let dce and cse remove the redundant code, the final result should look like: (set r101 (plus r100 400)) (set r201 (mem (plus r101 0))) ... (set r301 (mem (plus r101 4))) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314
[Bug rtl-optimization/40314] inefficient address calculation of fields of large struct
--- Comment #4 from carrot at google dot com 2009-05-31 08:05 --- (In reply to comment #3) > I think we have enough passes already and should try to stuff this in cse.c > and > fwprop.c. See PR middle-end/33699 for related issues. > It looks that patch solved some similar issues. But there are still several differences: 1. PR/33699 can only handle constant addresses, while in my case the addresses are not constants. And I believe non-constant cases (memory accesses through pointer) occurs more frequently than constant addresses(embedded system only?). 2. That patch can only be applicable to known base address. While in my case, the known base address of memory accesses are the pointer to struct, there is no known nearby base address, so we need to create a new nearby base address. 3. That patch works on superblock, but it looks better to optimize the memory accesses on the whole function body, it is quite common to access memory through same pointer in different basic blocks, as shown in mcf. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40314
[Bug rtl-optimization/40327] New: Use less instructions to add some constants to register
Compiling this simple function in thumb mode: int add_const(int x) { return x+400; } I got: mov r1, #200 lsl r3, r1, #1 add r0, r0, r3 A better code sequence should be: add r0, r0, 200 add r0, r0, 200 In order to apply this optimization, the constant should be less than 2 times of the largest immediate value in the target ISA. So this optimization should also useful to other architecture with limited immediate operand range. It can also be applied to sub instruction. -- Summary: Use less instructions to add some constants to register Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40327
[Bug target/40375] New: redundant register move with -mthumb
Compile the following code with -mthumb -O2 -Os, extern void foo(int*, const char*, int); void test(const char name[], int style) { foo(0, name, style); } I got: push{r4, lr} mov r3, r0 // A mov r2, r1 // B mov r0, #0 // C mov r1, r3 // D bl foo pop {r4, pc} Instructions A and D move register r0 to r1, actually it can be replaced with 1 instruction mov r1, r0 and place it between B and C. -- Summary: redundant register move with -mthumb Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375
[Bug target/40375] redundant register move with -mthumb
--- Comment #1 from carrot at google dot com 2009-06-08 03:23 --- Created an attachment (id=17962) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17962&action=view) test case shows the redundant register move This problem occurs quite frequently if both caller and callee have multiple parameters. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375
[Bug target/40375] redundant register move with -mthumb
--- Comment #4 from carrot at google dot com 2009-06-09 03:46 --- Thank you, Steven. (In reply to comment #3) > "might be" is such a useless statement. > > Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume? Then > you should have no trouble finding out: > 1) Where the move comes from This is rtl dump before RA, everything is in normal state: cat obj/reg.c.173r.asmcons ;; Function test (test) (note 1 0 5 NOTE_INSN_DELETED) (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 133 [ name ]) (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 0 r0 [ name ]) (nil))) (insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 134 [ style ]) (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 1 r1 [ style ]) (nil))) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0) (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil)) (insn 8 7 9 2 src/./reg.c:4 (set (reg:SI 1 r1) (reg/v/f:SI 133 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v/f:SI 133 [ name ]) (nil))) (insn 9 8 10 2 src/./reg.c:4 (set (reg:SI 2 r2) (reg/v:SI 134 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 134 [ style ]) (nil))) (call_insn 10 9 0 2 src/./reg.c:4 (parallel [ (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] ) [0 S4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 256 {*call_insn} (expr_list:REG_DEAD (reg:SI 2 r2) (expr_list:REG_DEAD (reg:SI 1 r1) (expr_list:REG_DEAD (reg:SI 0 r0) (nil (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2)) (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1)) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0)) (nil) Here is rtl dump after RA, quite straightforward but inefficient: (note 1 0 5 NOTE_INSN_DELETED) (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 3 r3 [orig:133 name ] [133]) (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (nil)) (insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 2 r2 [orig:134 style ] [134]) (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (nil)) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0) (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil)) (insn 8 7 10 2 src/./reg.c:4 (set (reg:SI 1 r1) (reg/v/f:SI 3 r3 [orig:133 name ] [133])) 168 {*thumb1_movsi_insn} (nil)) (call_insn 10 8 18 2 src/./reg.c:4 (parallel [ (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] ) [0 S4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 256 {*call_insn} (nil) (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2)) (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1)) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0)) (nil) (note 18 10 0 NOTE_INSN_DELETED) > 2) Why postreload (the post-reload CSE pass) does not eliminate the redundant > move > It seems the post-reload CSE pass can't handle this case. Because at instruction C r0 is killed so instruction can't use r0. In order to make it work for this case we must move instruction D before C first. As Andrew said we need to improve scheduling before RA to handle this. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375
[Bug c++/40382] New: Useless instructions in destructor
Compile following simple class with -O2 -Os -mthumb -fpic class base { virtual ~base(); }; base::~base() { } The destructor of this class should do nothing, just return is enough. But gcc generats following codes for D1 version destructor: ldr r3, .L3 ldr r1, .L3+4 add r3, pc ldr r2, [r3, r1] add r2, r2, #8 str r2, [r0] bx lr .L3: .word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4) .word _ZTV4base(GOT) -- Summary: Useless instructions in destructor Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40382
[Bug c++/40382] Useless instructions in destructor
--- Comment #1 from carrot at google dot com 2009-06-09 07:35 --- Created an attachment (id=17969) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17969&action=view) simple class with empty virtual destructor Some tree dump result 1. The tree dump of early stage: cat test_class.cpp.003t.original ;; Function virtual base::~base() (null) ;; enabled by -tree-original { <_vptr.base = &_ZTV4base + 8) >>> >>; } :; if ((bool) (__in_chrg & 1)) { <>> >>; } return this; 2. The tree dump of late stage, the reset of vptr is redundant. cat test_class.cpp.130t.final_cleanup ;; Function base::~base() (_ZN4baseD2Ev) base::~base() (struct base * const this) { : this->_vptr.base = &_ZTV4base[2]; return this; } ;; Function virtual base::~base() (_ZN4baseD1Ev) virtual base::~base() (struct base * const this) { : this->_vptr.base = &_ZTV4base[2]; return this; } ;; Function virtual base::~base() (_ZN4baseD0Ev) virtual base::~base() (struct base * const this) { : this->_vptr.base = &_ZTV4base[2]; operator delete (this); return this; } -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40382
[Bug target/40375] redundant register move with -mthumb
--- Comment #6 from carrot at google dot com 2009-06-09 13:52 --- (In reply to comment #5) > Hmm, I was under the impression that postreload-cse could move instructions > too, but that was just wishful thinking. > I will look into postreload-cse. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40375
[Bug target/40416] New: unnecessary register spill
Compile the attached source code with options -O2 -Os -mthumb -fpic, we can get a unnecessary register spill. -- Summary: unnecessary register spill Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416
[Bug target/40416] unnecessary register spill
--- Comment #1 from carrot at google dot com 2009-06-11 14:34 --- Created an attachment (id=17983) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17983&action=view) test case The spilling is occurred around the first loop: push{r4, r5, r6, r7, lr} sub sp, sp, #12 .loc 1 5 0 str r2, [sp, #4] // A .loc 1 6 0 add r6, r1, r2 mov r4, r0 .loc 1 8 0 b .L2 .L5: .loc 1 10 0 mov r7, #0 ldrsh r5, [r4, r7] .loc 1 12 0 cmp r2, r5 bge .L3 .loc 1 14 0 ldrbr7, [r1] strbr7, [r1, r2] .loc 1 15 0 strhr2, [r4] .loc 1 16 0 lsl r1, r2, #1 sub r2, r5, r2 strhr2, [r1, r4] .L6: .loc 1 5 0 ldr r5, [sp, #4] // B lsl r4, r5, #1 add r0, r0, r4 b .L4 .L3: .loc 1 19 0 lsl r7, r5, #1 mov ip, r7 add r4, r4, ip .loc 1 20 0 add r1, r1, r5 .loc 1 21 0 sub r2, r2, r5 .L2: .loc 1 8 0 cmp r2, #0 bgt .L5 b .L6 .L4: .loc 1 30 0 mov r1, #0 The spilling is occurred at instruction A and reload at instruction B. The spilled value is x. The source code computes next_runs and next_alpha before while loop and preserve them through the loop body. But the generated code preserve next_alpha, original runs and original x through the loop body and compute next_runs after the loop. This caused an extra usage of register and results in a register spilling. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416
[Bug target/40416] unnecessary register spill
--- Comment #3 from carrot at google dot com 2009-06-15 02:26 --- Created an attachment (id=17998) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17998&action=view) preprocessed test case A possible code sequence without spilling is: push{r4, r5, r6, r7, lr} add r6, r1, r2 mov r4, r0 lsl r7, r2, 1 // New add r0, r0, r7// New .loc 1 8 0 b .L2 .L5: .loc 1 10 0 mov r7, #0 ldrsh r5, [r4, r7] .loc 1 12 0 cmp r2, r5 bge .L3 .loc 1 14 0 ldrbr7, [r1] strbr7, [r1, r2] .loc 1 15 0 strhr2, [r4] .loc 1 16 0 lsl r1, r2, #1 sub r2, r5, r2 strhr2, [r1, r4] .L6: .loc 1 5 0 b .L4 .L3: .loc 1 19 0 lsl r7, r5, #1 mov ip, r7 add r4, r4, ip .loc 1 20 0 add r1, r1, r5 .loc 1 21 0 sub r2, r2, r5 .L2: .loc 1 8 0 cmp r2, #0 bgt .L5 b .L6 .L4: .loc 1 30 0 mov r1, #0 -- carrot at google dot com changed: What|Removed |Added Attachment #17983|0 |1 is obsolete|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416
[Bug target/40416] unnecessary register spill
--- Comment #4 from carrot at google dot com 2009-06-15 02:32 --- In the source code, only two extra variables next_runs and next_alpha need to be preserved through the while loop. But in the gcc generated code, three variables are kept through the first loop. They are next_alpha, original runs and original x. The expression (next_runs = runs + x) is moved after the loop. This caused an extra var through the loop and resulted in register spilling. The expression move is occurred in tree-ssa-sink pass. Daniel Berlin has confirmed it is a bug in this pass. From Daniel ** This looks like a bug, i think i know what causes it. When I wrote this pass, i forgot to make this check: /* It doesn't make sense to move to a dominator that post-dominates frombb, because it means we've just moved it into a path that always executes if frombb executes, instead of reducing the number of executions . */ if (dominated_by_p (CDI_POST_DOMINATORS, frombb, commondom)) happen regardless of whether it is a single use statement or not. So it will sink single use statements even if it's just moving them to places that aren't executed less frequently. Add that check (changing commondom to sinkbb) and it should stop moving it. *** End From Daniel I will send the patch later. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416
[Bug target/40457] New: use stm and ldm to access consecutive memory words
Current gcc can't make use of stm and ldm to reduce code size. -- Summary: use stm and ldm to access consecutive memory words Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457
[Bug target/40457] use stm and ldm to access consecutive memory words
--- Comment #1 from carrot at google dot com 2009-06-16 09:11 --- Created an attachment (id=18005) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18005&action=view) test case For this function void foo(int* p) { p[0] = 1; p[1] = 2; } gcc generates: mov r1, #1 mov r3, #2 str r1, [r0] str r3, [r0, #4] bx lr We use one stm instruction to replace two str instructions. For the second case: int bar(int* p) { int x = p[0] + p[1]; return x; } gcc generates: ldr r2, [r0, #4] ldr r3, [r0] add r0, r2, r3 bx lr In this case we can use on ldm to replace the two ldr instructions. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457
[Bug target/40457] use stm and ldm to access consecutive memory words
--- Comment #7 from carrot at google dot com 2009-06-17 09:30 --- My command line option is -O2 -Os -mthumb The compiler didn't run into load_multiple_sequence and store_multiple_sequence. The peephole rules specified it applies to TARGET_ARM only. Is there any special reason we didn't enable it in thumb mode? For the ascending register number, do we have any code to rename a set of registers to make them ascending? In the generated code for the second function, the register numbers have different order compared with memory offsets. ldr r2, [r0, #4] ldr r3, [r0] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40457
[Bug target/40482] New: shift a small constant to get larger one
One example is 0xff00, we can get it by mov r1, 255 lsl r1, r1, 24 Gcc generates following code: ldr r1, .L2 ... .L2 .word -16777216 -- Summary: shift a small constant to get larger one Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40482
[Bug target/40482] shift a small constant to get larger one
--- Comment #1 from carrot at google dot com 2009-06-18 07:34 --- Created an attachment (id=18018) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18018&action=view) test case command line option is -O2 -Os -mthumb -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40482
[Bug target/40499] New: [missed optimization] branch to return
If the function epilogue has only one return instruction, then the branch to return can be replaced by the return instruction directly. -- Summary: [missed optimization] branch to return Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499
[Bug target/40499] [missed optimization] branch to return
--- Comment #1 from carrot at google dot com 2009-06-20 03:56 --- Created an attachment (id=18027) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18027&action=view) test case The command line options are: -march=armv5te -mthumb -Os At the end of the function we can see b .L3 // This one can be replaced by pop {pc} .L5: mov r0, #1 .L3: @ sp needed for prologue pop {pc} With option -O2 we can get similar result. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499
[Bug target/40499] [missed optimization] branch to return not threaded on thumb
--- Comment #4 from carrot at google dot com 2009-06-22 08:00 --- Sorry I didn't make it clear. It is a performance bug, not a code size issue. If the epilogue is a simple return instruction, the branch to return can be replaced by the return instruction. So we can execute one less instruction at run time without any code size penalty. It looks the code at function.c:5078 can't be applied to thumb. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40499
[Bug target/40525] New: missed optimization in conditional expression
For simple conditional expression like (flag == 1 ? 2 : 0), gcc generates not optimized code in terms of both code size and performance. -- Summary: missed optimization in conditional expression Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40525
[Bug target/40525] missed optimization in conditional expression
--- Comment #2 from carrot at google dot com 2009-06-23 09:09 --- Created an attachment (id=18053) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18053&action=view) test case Compile the attached code with options -mthumb -march=armv5te -Os, gcc generates push{lr} cmp r1, #1 bne .L3 mov r3, #2 b .L2 .L3: mov r3, #0 .L2: add r0, r3, r0 pop {pc} A better code sequence can be: push{lr} mov r3, 0 cmp r1, #1 bne .L3 mov r3, #2 .L3: add r0, r3, r0 pop {pc} With this optimization, we can reduce 1 instruction. For both equal and not equal case, the number of executed instructions is same as previous. But in equal case one branch instruction is replaced by a move instruction. So it is also win for performance. Which pass should this optimization be done? Jump pass or bb reorder pass? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40525
[Bug target/40416] unnecessary register spill
--- Comment #6 from carrot at google dot com 2009-06-30 07:42 --- http://gcc.gnu.org/ml/gcc-cvs/2009-06/msg01067.html -- carrot at google dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40416
[Bug target/40603] New: unnecessary conversion from unsigned byte load to signed byte load
Compile the following function with options -Os -mthumb -march=armv5te int ldrb(unsigned char* p) { if (p[8] <= 0x7F) return 2; else return 5; } Gcc generates following codes: push{lr} mov r3, #8 ldrsb r3, [r0, r3] mov r0, #2 cmp r3, #0 bge .L2 mov r0, #5 .L2: @ sp needed for prologue pop {pc} The source codeif (p[8] <= 0x7F) is translated to: mov r3, #8 ldrsb r3, [r0, r3] cmp r3, #0 A better code sequence should be: ldrbr3, [r0, 8] cmp r3, 0x7F This can save one instruction. The tree dump shows in a very early pass (ldrb.c.003t.original) the comparison was transformed to if ((signed char) *(p + 8) >= 0) I guess gcc thinks comparing with 0 is much cheaper than comparing with other numbers. Am I right? Unfortunately in thumb mode, loading a signed byte costs more than loading an unsigned byte and comparing with 0 has same cost as comparing with 0x7F. -- Summary: unnecessary conversion from unsigned byte load to signed byte load Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603
[Bug target/40603] unnecessary conversion from unsigned byte load to signed byte load
--- Comment #1 from carrot at google dot com 2009-07-01 06:56 --- Created an attachment (id=18105) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18105&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603
[Bug target/40603] unnecessary conversion from unsigned byte load to signed byte load
--- Comment #3 from carrot at google dot com 2009-07-01 10:24 --- (In reply to comment #2) > Subject: Re: New: unnecessary conversion from unsigned > byte load to signed byte load > > > > Unfortunately in thumb mode, loading a signed byte costs more than loading > > an > > unsigned byte and comparing with 0 has same cost as comparing with 0x7F. > > I don't know of any core where loading a signed byte is more expensive > than unsigned byte in thumb mode. What did you have in mind ? > > I suspect what you mean is that the sign extension here is not required > and we could get away with ldrb. > In thumb1, instruction ldrb has an addressing mode of Rn + imm5, but ldrsb has only addressing mode of Rn + Rm. So loading unsigned byte from p[8] needs only one instruction ldrb r3, [r0, 8] But loading singed byte from p[8] needs two instructions: mov r3, 8 ldrsb r3, [r0, r3] So in this case (base + constant offset), loading a signed byte is more expensive than unsigned byte in thumb mode. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40603
[Bug target/40615] New: unnecessary CSE
Compile the attached source code with options -march=armv5te -mthumb -Os -fno-exceptions, gcc generates: push{r4, lr} sub sp, sp, #8 add r4, sp, #4// redundant mov r0, r4// add r0, sp, 4 bl _ZN1XC1Ev mov r0, r4// add r0, sp, 4 bl _Z3barP1X mov r0, r4// add r0, sp, 4 bl _ZN1XD1Ev add sp, sp, #8 @ sp needed for prologue pop {r4, pc} As mentioned in the comments, the cse is redundant. We can recompute the value of (sp + 4) each time we want it. With this method we can save one instruction. -- Summary: unnecessary CSE Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40615
[Bug target/40615] unnecessary CSE
--- Comment #1 from carrot at google dot com 2009-07-02 07:39 --- Created an attachment (id=18120) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18120&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40615
[Bug target/40657] New: allocate local variables with fewer instructions
Compile following code with options -Os -mthumb -march=armv5te extern void bar(int*); int foo() { int x; bar(&x); return x; } Gcc generates: push{lr} sub sp, sp, #12 add r0, sp, #4 bl bar ldr r0, [sp, #4] add sp, sp, #12 @ sp needed for prologue pop {pc} A better code sequence could be: push{r1-r3,lr} add r0, sp, #4 bl bar ldr r0, [sp, #4] @ sp needed for prologue pop {r1-r3, pc} The local variable allocation and deallocation can be merged into the push/pop instruction, so we can avoid the extra sub/add instructions and reduce two instructions. -- Summary: allocate local variables with fewer instructions Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657
[Bug target/40657] allocate local variables with fewer instructions
--- Comment #1 from carrot at google dot com 2009-07-06 08:16 --- Created an attachment (id=18140) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18140&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657
[Bug target/40657] allocate local variables with fewer instructions
--- Comment #5 from carrot at google dot com 2009-07-07 06:44 --- Could we do the optimization in function thumb1_expand_prologue? If we find this opportunity in function thumb1_expand_prologue, we can remove the sp manipulations from prologue and epilogue. We also should add extra registers to push/pop operands. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40657
[Bug target/40670] New: Load floating point constant 0 directly
Compile following function with options -Os -mthumb -march=armv5te float return_zero() { return 0; } Gcc generates: ldr r0, .L2 bx lr .L3: .align 2 .L2: .word 0 Floating point 0 is also integer 0. So the function body can be simplified as mov r0, 0 bx lr Now we can remove the memory load and constant pool. The result code is smaller and faster. The memory load and constant pool is expanded in pass machine_reorg. -- Summary: Load floating point constant 0 directly Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40670
[Bug target/40670] Load floating point constant 0 directly
--- Comment #1 from carrot at google dot com 2009-07-07 09:38 --- Created an attachment (id=18149) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18149&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40670
[Bug target/40680] New: extra register move
Compile the attached source code with options -Os -mthumb -march=armv5te, gcc generates: push{r3, r4, r5, lr} .LCFI0: mov r4, r0 ldr r0, [r0] bl _Z3foof ldr r1, [r4, #4] @ sp needed for prologue add r5, r0, #0 bl _Z3barfi mov r0, r5 // * bl _Z3fffi // * mov r4, r5 // * mov r5, r0 // * mov r0, r4 // * bl _Z3fffi // * mov r1, r0 // * mov r0, r5 // * bl _Z3setii pop {r3, r4, r5, pc} There is an obvious extra register move (mov r4, r5) in the marked section, a better code sequence of the marked section could be: mov r0, r5 bl _Z3fffi mov r4, r0 mov r0, r5 bl _Z3fffi mov r1, r0 mov r0, r4 The marked code sequence before scheduler is: mov r4, r5 mov r0, r5 bl _Z3fffi mov r5, r0 mov r0, r4 bl _Z3fffi mov r1, r0 mov r0, r5 The instruction (mov r4, r5 ) is generated by register allocator. I don't know why RA generates this instruction. -- Summary: extra register move Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40680
[Bug target/40680] extra register move
--- Comment #1 from carrot at google dot com 2009-07-08 09:36 --- Created an attachment (id=18155) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18155&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40680
[Bug target/40697] New: inefficient code to extract least bits from an integer value
Compile following function with options -Os -mthumb -march=armv5te unsigned get_least_bits(unsigned value) { return value << 9 >> 9; } Gcc generates: ldr r3, .L2 @ sp needed for prologue and r0, r0, r3 bx lr .L3: .align 2 .L2: .word 8388607 A better code sequence should be: lsl r0, 9 lsr r0, 9 bxlr It is smaller (without constant pool) and faster. This transformation was done very early and we can see it in the first tree dump shift.c.003t.original. Gcc thinks and with a constant is cheaper than two shifts. It is not true for this case in thumb ISA. On the other hand if the constant used to and is small, such as 7, it is definitely cheaper than two shifts. So which method is better is highly depend on both the constant and the target ISA. It is difficult to make a correct decision in the TREE level. Maybe we should define a peephole rule to do it. -- Summary: inefficient code to extract least bits from an integer value Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40697
[Bug target/40697] inefficient code to extract least bits from an integer value
--- Comment #1 from carrot at google dot com 2009-07-09 09:24 --- Created an attachment (id=18166) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18166&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40697
[Bug target/40730] New: redundant memory load
Compile the attached source code with options -Os -mthumb -march=armv5te -fno-strict-aliasing, Gcc generates: iterate: push{lr} ldr r3, [r1]// C b .L5 .L4: ldr r3, [r3, #8]// D .L5: str r3, [r0]// A ldr r3, [r0]// B cmp r3, #0 beq .L3 ldr r2, [r3, #4] cmp r2, #0 beq .L4 .L3: str r3, [r0, #12] @ sp needed for prologue pop {pc} Pay attention to instructions marked as A and B. Instruction A store r3 to [r0] but insn B load it back to r3. The instruction A was originally put after instruction C and D. After register allocation, they were allocated to the same registers and looks exactly same. In pass csa, cleanup_cfg was called and it found the same instructions and moved them before instruction B. Now instruction B is obviously redundant. Is it OK to remove this kind of redundant code in pass dce? -- Summary: redundant memory load Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
[Bug target/40730] redundant memory load
--- Comment #1 from carrot at google dot com 2009-07-13 08:58 --- Created an attachment (id=18183) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18183&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
[Bug target/40741] New: code size explosion for integer comparison
Compile following function with options -Os -mthumb -march=armv5te: int returnbool(int a, int b) { if (a < b) return 1; return 0; } Gcc 4.5 generates: lsr r3, r1, #31 asr r2, r0, #31 cmp r0, r1 adc r2, r2, r3 mov r0, r2 mov r3, #1 eor r0, r0, r3 @ sp needed for prologue bx lr while gcc 4.3.1 generates: push{lr} mov r3, #1 cmp r0, r1 blt .L2 mov r3, #0 .L2: mov r0, r3 @ sp needed for prologue pop {pc} If we count instructions to do comparison only, they are 7 vs 4. I don't know if it is faster to replace one branch instruction with 4 alu instructions. It is definitely a regression for code size. The long code sequence is generated by expand pass. -- Summary: code size explosion for integer comparison Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40741
[Bug target/40741] code size explosion for integer comparison
--- Comment #1 from carrot at google dot com 2009-07-14 08:41 --- Created an attachment (id=18191) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18191&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40741
[Bug target/40730] redundant memory load
--- Comment #4 from carrot at google dot com 2009-07-14 09:14 --- In TREE level, the two stores are different statements. Only after register allocation, the two stores get same register and make the load redundant. try_crossjump_bb tries to find same instruction sequence in all predecessors of a basic block bb, and move that code sequence to head of bb. It is triggered by this function, and the store is moved just before the load. I tried -fgcse-las but it couldn't do the work. (In reply to comment #2) > -fgcse-las should do the trick. Note that PRE would do this kind of > optimization on the tree-level, but it is disabled with -Os (so is gcse). > > : > D.1614_2 = p2_1(D)->front; > p1_3(D)->head = D.1614_2; > goto ; > > : > D.1616_8 = D.1615_4->next; > p1_3(D)->head = D.1616_8; > > : > D.1615_4 = p1_3(D)->head; > -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
[Bug target/40730] redundant memory load
--- Comment #7 from carrot at google dot com 2009-07-15 08:07 --- (In reply to comment #6) > Carrot, can you please try this test case with my patch > "crossjump_abstract.diff" from Bug 20070 applied? > I tried your patch. It did remove the redundant memory load. Following is the output push{lr} ldr r3, [r1] .L6: str r3, [r0] mov r2, r3 // M cmp r3, #0 bne .L5 b .L3 .L4: ldr r3, [r3, #8] b .L6 .L5: ldr r1, [r3, #4] cmp r1, #0 beq .L4 .L3: str r2, [r0, #12] @ sp needed for prologue pop {pc} In pass ifcvt it noticed the difference of two stores is the different pseudo register number and there is no conflict between the two pseudo registers, so it rename one of them to the same as another and do basic block cross jump on them earlier. Then pass iterate.c.161r.cse2 detected the redundant load and remove it. But it introduced another redundant move instruction marked as M. At the place r2 is used, r3 still contain the same result as r2, so we can also use r3 there. I think this is another problem. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40730
[Bug target/40783] New: inefficient code to accumulate function return values
Compile the following code with options -Os -mthumb -march=armv5te union FloatIntUnion { float fFloat; int fSignBitInt; }; static inline float fast_inc(float x) { union FloatIntUnion data; data.fFloat = x; data.fSignBitInt += 1; return data.fFloat; } extern int MyConvert(float); extern float dumm(); int time_math() { int i; int sum = 0; const int repeat = 100; float f; f = dumm(); for (i = repeat - 1; i >= 0; --i) { sum += (int)f; f = fast_inc(f); sum += (int)f; f = fast_inc(f); sum += (int)f; f = fast_inc(f); sum += (int)f; f = fast_inc(f); } f = dumm(); for (i = repeat - 1; i >= 0; --i) { sum += MyConvert(f); f = fast_inc(f); sum += MyConvert(f); f = fast_inc(f); sum += MyConvert(f); f = fast_inc(f); } return sum; } Gcc generates: push{r4, r5, r6, r7, lr} sub sp, sp, #12 bl dumm mov r4, #0 mov r6, #99 add r5, r0, #0 .L2: add r0, r5, #0 bl __aeabi_f2iz add r5, r5, #1 add r4, r0, r4 add r0, r5, #0 bl __aeabi_f2iz add r5, r5, #1 add r4, r4, r0 add r0, r5, #0 bl __aeabi_f2iz add r5, r5, #1 add r4, r4, r0 add r0, r5, #0 bl __aeabi_f2iz add r5, r5, #1 add r4, r4, r0 sub r6, r6, #1 bcs .L2 bl dumm mov r6, #99 add r5, r0, #0 .L3: add r0, r5, #0 bl MyConvert add r5, r5, #1 str r0, [sp, #4] add r0, r5, #0 bl MyConvert add r5, r5, #1 mov r7, r0 add r0, r5, #0 bl MyConvert ldr r3, [sp, #4] add r5, r5, #1 add r7, r7, r3 add r7, r7, r0 add r4, r4, r7 sub r6, r6, #1 bcs .L3 add sp, sp, #12 mov r0, r4 @ sp needed for prologue pop {r4, r5, r6, r7, pc} The source code contains 2 similar loops. But the generated code are quite different. The code for first loop is as expected. After evaluating each function, accumulates the returned value immediately. The code for second loop is much worse. After evaluating each function, it saves the returned value to a different place. After calling all functions in the same round of loop, it accumulates all the saved results together. The code for second loop is larger and slower, and even caused a register spilling. The intermediate representation patterns for the two loops started to diverge from pass float2int.c.078t.reassoc1. I don't know why gcc performs different transforms on the two loops in this pass. -- Summary: inefficient code to accumulate function return values Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40783
[Bug target/40783] inefficient code to accumulate function return values
--- Comment #1 from carrot at google dot com 2009-07-17 06:56 --- Created an attachment (id=18212) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18212&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40783
[Bug target/40815] New: redundant neg instruction caused by loop-invariant
Compile following function with options -Os -mthumb -march=armv5te void bar(char*, char*, int); void foo(char* left, char* rite, int element) { while (left <= rite) { rite -= element; bar(left, rite, element); left += element; } } Gcc generates: push{r3, r4, r5, r6, r7, lr} mov r5, r0 mov r6, r1 mov r7, r2 neg r4, r2// A b .L2 .L3: add r6, r6, r4// B mov r0, r5 mov r1, r6 mov r2, r7 bl bar add r5, r5, r7 .L2: cmp r5, r6 bls .L3 @ sp needed for prologue pop {r3, r4, r5, r6, r7, pc} Note that instruction A computes (r4 = -r2), and r4 is only used by instruction B (r6 = r6 + r4), this can be simplified to (r6 = r6 - r7, r7 contains the original r2). Thus we can reduce one instruction. Expression rite -= element was transformed to the following by the gimplify pass element.0 = (unsigned int) element; D.2003 = -element.0; rite = rite + D.2003; This form was kept until pass neg.c.156r.loop2_invariant. Then the expression -element was identified as loop invariant and hoisted out of the loop. So caused the current result. Is the transform of gimplify intended? Do we have any chance to merge the previous expressions back to (rite = rite - element) before loop invariant pass? -- Summary: redundant neg instruction caused by loop-invariant Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815
[Bug target/40815] redundant neg instruction caused by loop-invariant
--- Comment #1 from carrot at google dot com 2009-07-21 07:15 --- Created an attachment (id=18234) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18234&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815
[Bug target/40815] redundant neg instruction caused by loop-invariant
--- Comment #3 from carrot at google dot com 2009-07-21 07:35 --- Created an attachment (id=18235) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18235&action=view) dump of -fdump-rtl-expand-details -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40815
[Bug target/40835] New: redundant comparison instruction
Compile the following code with options -Os -mthumb -march=armv5te int bar(); void goo(int, int); void foo() { int v = bar(); if (v == 0) return; goo(1, v); } Gcc generates: push{r3, lr} bl bar mov r1, r0 cmp r0, #0// * beq .L1 mov r0, #1 bl goo .L1: @ sp needed for prologue pop {r3, pc} The compare instruction is redundant since the previous move instruction has already set the condition code according to the value of r0. -- Summary: redundant comparison instruction Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835
[Bug target/40835] redundant comparison instruction
--- Comment #1 from carrot at google dot com 2009-07-23 08:38 --- Created an attachment (id=18241) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18241&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835
[Bug target/40835] redundant comparison instruction
--- Comment #2 from carrot at google dot com 2009-07-24 02:11 --- It seems HAVE_cc0 disabled for arm. What's the reason behind it? A simple method is to add a peephole rule to handle it. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835
[Bug target/40835] redundant comparison instruction
--- Comment #4 from carrot at google dot com 2009-07-24 07:37 --- Just as I've figured out HAVE_cc0 is disabled. And cse_condition_code_reg does nothing for thumb target. I also found that the conditional branch instructions is always in the same insn pattern as the previous compare instructions. So I even wonder there is any way to express the optimized sequence (movs followed by bcc). Is there any other places that I should take a look? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40835
[Bug target/40900] New: redundant sign extend of short function returned value
Compile the following code with options -Os -mthumb -march=armv5te extern short shortv2(); short shortv1() { return shortv2(); } Gcc generates push{r3, lr} bl shortv2 lsl r0, r0, #16// A asr r0, r0, #16// B pop {r3, pc} The returned value in register r0 is already a sign extended short value, but instructions A and B sign extend it again. So these two instructions are redundant. -- Summary: redundant sign extend of short function returned value Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40900
[Bug target/40900] redundant sign extend of short function returned value
--- Comment #1 from carrot at google dot com 2009-07-29 08:57 --- Created an attachment (id=18266) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18266&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40900
[Bug target/40956] New: GCSE opportunity in if statement
Compile the following function with options -Os -mthumb -march=armv5te -frename-registers int foo(int p, int* q) { if (p!=9) *q = 0; else *(q+1) = 0; return 3; } GCC generates: push{lr} cmp r0, #9 // D beq .L2 mov r3, #0 // A str r3, [r1] b .L3 .L2: mov r0, #0 // B str r0, [r1, #4] // C .L3: mov r0, #3 pop {pc} If we replace r0 with r3 in instructions B and C, then A and B will be same. So we can move the same instruction before the instruction D and reduce 1 instruction. Is it a gcse opportunity? -- Summary: GCSE opportunity in if statement Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40956
[Bug target/40956] GCSE opportunity in if statement
--- Comment #1 from carrot at google dot com 2009-08-03 22:55 --- Created an attachment (id=18294) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18294&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40956
[Bug target/41004] New: missed merge of basic blocks
Compile the attached source code with options -Os -march=armv5te -mthumb Gcc generates following code snippet: ... cmp r0, r2 bne .L5 b .L15<--- A .L9: ldr r3, [r1] cmp r3, #0 beq .L7 str r0, [r1, #8] b .L8 .L7: str r3, [r1, #8] .L8: ldr r1, [r1, #4] b .L12 < C .L15: mov r0, #1 <--- B .L12: cmp r1, r2 < D bne .L9 ... inst A jump to B then fall through to D inst C jump to D there is no other instructions jump to instruction B, so we can put inst B just before A, then A jump to D, and C can be removed. There are two possible functions can potentially do this optimization. They are merge_blocks_move and try_forward_edges. Function try_forward_edges can only redirect a series of forwarder blocks. It can't move the target blocks before the forwarder blocks. In function merge_blocks_move only when both block b and c aren't forwarder blocks then can they be merged. In this case block A is a forwarder block, so they are not merged. -- Summary: missed merge of basic blocks Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004
[Bug target/41004] missed merge of basic blocks
--- Comment #1 from carrot at google dot com 2009-08-08 00:10 --- Created an attachment (id=18326) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18326&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004
[Bug c/39989] New: [optimization]
Compiling this code snippet with gcc for arm, typedef struct node node_t; typedef struct node *node_p; struct node { int orientation; node_p pred; long depth; }; node_t *primal_iminus(long *delta, node_t *iplus, node_t*jplus) { node_t *iminus = 0; if( iplus->depth < jplus->depth ) { if( iplus->orientation ) iminus = iplus; iplus = iplus->pred; } return iminus; } I got: .save {lr} push{lr} .LCFI0: .LVL0: .loc 1 13 0 ldr r0, [r1, #8] .LVL1: ldr r3, [r2, #8] cmp r0, r3 bge .L2 .loc 1 15 0 ldr r2, [r1] .LVL2: cmp r2, #0 beq .L2 mov r0, r1 .LVL3: b .L3 .LVL4: .L2: mov r0, #0 .LVL5: .L3: .LVL6: .loc 1 20 0 @ sp needed for prologue pop {pc} In which lr is still live at the exit of the function, we can simply use BX lr to return and avoid the prolog instruction push {lr}. The options I used is: -fno-exceptions -Wno-multichar -march=armv5te -mtune=xscale -msoft-float -fpic -mthumb-interwork -ffunction-sections -funwind-tables -fstack-protector -fno-short-enums -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ -fmessage-length=0 -W -Wall -Wno-unused -DSK_RELEASE -DNDEBUG -g -Wstrict-aliasing=2 -fgcse-after-reload -frerun-cse-after-loop -frename-registers -DNDEBUG -UDEBUG -MD -O2 -Os -mthumb -fomit-frame-pointer -fno-strict-aliasing -finline-limit=64 -finline-functions -fno-inline-functions-called-once -- Summary: [optimization] Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989
[Bug target/39989] No need to save LR in some cases
--- Comment #1 from carrot at google dot com 2009-05-01 06:12 --- Created an attachment (id=17787) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=17787&action=view) sample code showing the optimization opportunity -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989
[Bug target/39989] No need to save LR in some cases
--- Comment #2 from carrot at google dot com 2009-05-01 06:21 --- Actually gcc has already implemented this optimization, but it doesn't work for this case. Reload pass tries to determine the stack frame, so it needs to check the push/pop lr optimization opportunity. One of the criteria is if there is any far jump inside the function. Unfortunately at this time gcc can't decide each instruction's length and basic block layout, so it can't know the offset of a jump. To be conservative it assumes every jump is a far jump. So any jump in a function will prevent this push/pop lr optimization. -- carrot at google dot com changed: What|Removed |Added CC| |carrot at google dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39989
[Bug target/38570] [arm] -mthumb generates sub-optimal prolog/epilog
--- Comment #6 from carrot at google dot com 2009-05-04 02:21 --- We can compute the maximum possible function length first. If the length is not large enough far jump is not necessary, and we can do this optimization safely. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38570
[Bug target/56993] New: power gcc built 416.gamess generates wrong result
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56993 Bug #: 56993 Summary: power gcc built 416.gamess generates wrong result Classification: Unclassified Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Host: powerpc-linux-gnu Target: powerpc-linux-gnu Build: powerpc-linux-gnu When I use the trunk gcc to run spec2006 416.gamess, I got the following error $ runspec --config=test.cfg --tune=base --size=test --nofeedback --noreportable game runspec v6152 - Copyright 1999-2008 Standard Performance Evaluation Corporation Using 'linux-ydl23-ppc' tools Reading MANIFEST... 18357 files Loading runspec modules Locating benchmarks...found 31 benchmarks in 6 benchsets. Reading config file '/usr/local/google/carrot/spec2006/config/test.cfg' Benchmarks selected: 416.gamess Compiling Binaries Building 416.gamess base Linux64 default: (build_base_Linux64.) Build successes: 416.gamess(base) Setting Up Run Directories Setting up 416.gamess test base Linux64 default: created (run_base_test_Linux64.) Running Benchmarks Running (#1) 416.gamess test base Linux64 default Contents of exam29.err STOP IN ABRT *** Miscompare of exam29.out; for details see /usr/local/google/carrot/spec2006/benchspec/CPU2006/416.gamess/run/run_base_test_Linux64./exam29.out.mis Invalid run; unable to continue. If you wish to ignore errors please use '-I' or ignore_errors The log for this run is in /usr/local/google/carrot/spec2006/result/CPU2006.111.log The debug log for this run is in /usr/local/google/carrot/spec2006/result/CPU2006.111.log.debug * * Temporary files were NOT deleted; keeping temporaries such as * /usr/local/google/carrot/spec2006/result/CPU2006.111.log.debug and * /usr/local/google/carrot/spec2006/tmp/CPU2006.111 * (These may be large!) * runspec finished at Wed Apr 17 16:37:27 2013; 93 total seconds elapsed My gcc is configured as $ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/powerpc-linux-gnu/4.6/lto-wrapper Target: powerpc-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.6.2-12' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --enable-secureplt --disable-softfloat --enable-targets=powerpc-linux,powerpc64-linux --with-cpu=default32 --with-long-double-128 --enable-checking=release --build=powerpc-linux-gnu --host=powerpc-linux-gnu --target=powerpc-linux-gnu Thread model: posix gcc version 4.6.2 (Debian 4.6.2-12) GCC4.8 has the same error, but gcc4.7 is good.
[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398 Carrot changed: What|Removed |Added CC||carrot at google dot com --- Comment #4 from Carrot 2012-09-07 01:19:31 UTC --- The code before the position Ahmad pointed out is already wrong. The fault instruction sequence is: asrsr5, r5, #1 asr ip, ip, #1 ; A, tmp1.x asrsr0, r0, #1 ; B, tmp1.y asrsr6, r6, #1 mov r4, r1 add r8, ip, r6 ; C, tmp3.x add r9, r0, r5 ; D, tmp3.y add r7, sp, #0 asr r1, r8, #1 add ip, r4, #8 ; E, asr r9, r9, #1 str r1, [r7, #16] str r9, [r7, #20] ldmia r3, {r0, r1}; F, stmia r4, {r0, r1} Instruction A computes the result of tmp1.x, instruction C use it to compute tmp3.x, instruction E overwrite the value of tmp1.x. But in the source code, tmp1.x is still needed to execute "dst1->p2 = tmp1;", so at last dest1->p2.x gets garbage. Similarly instruction B computes tmp1.y, instruction D uses it to compute tmp3.y, instruction F overwrites it. After executing "dst1->p2 = tmp1;", dst1->p2.y gets another garbage value. For comparison, following is the correct version asrsr7, r7, #1; A, tmp1.x asrsr0, r0, #1; B, tmp1.y asrsr6, r6, #1 asrsr5, r5, #1 sub sp, sp, #28 mov r4, r1 add r8, r7, r6; C, tmp3.x add ip, r0, r5; D, tmp3.y str r7, [sp, #0] ; X, save tmp1.x str r0, [sp, #4] ; Y, save tmp1.y asr r1, ip, #1 add r7, r4, #8 ; E asr r8, r8, #1 str r1, [sp, #20] str r8, [sp, #16] ldmia r3, {r0, r1} ; F stmia r4, {r0, r1} The obvious difference is the extra instructions X and Y, they save the value of tmp1 to stack before reusing the register. The simplified preprocessed source code is struct A { int x; int y; void f(const A &a, const A &b) { x = (a.x + b.x)>>1; y = (a.y + b.y)>>1; } }; class C { public: A p1; A p2; A p3; bool b; void g(C *, C *) const; }; void C::g(C *dst1, C *dst2) const { A tmp1, tmp2, tmp3; tmp1.f(p2,p1); tmp2.f(p2,p3); tmp3.f(tmp1, tmp2); dst1->p1 = p1; dst1->p2 = tmp1; dst1->p3 = dst2->p1 = tmp3; dst2->p2 = tmp2; dst2->p3 = p3; } The simplified command line is: ./cc1plus -fpreprocessed t.ii -quiet -dumpbase t.cpp -mthumb "-march=armv7-a" "-mtune=cortex-a15" -auxbase t -O2 -fno-omit-frame-pointer -o t.s It looks like the dse2 pass did wrong transformation. The gcc4.7 and trunk generate correct code.
[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398 --- Comment #5 from Carrot 2012-09-11 00:10:45 UTC --- It's the bug in local dse sub step in dse.c. 66 (insn/f 70 69 71 2 (set (reg/f:SI 7 r7) 67 (plus:SI (reg/f:SI 13 sp) 68 (const_int 0 [0]))) t.ii:24 -1 69 (nil)) This insn setup the hfp, r7 197 (insn 12 30 17 2 (set (mem/s/c:SI (reg/f:SI 7 r7) [4 tmp1.x+0 S4 A64]) 198 (reg:SI 12 ip [orig:137 D.1799 ] [137])) t.ii:8 694 {*thumb2_movsi_insn} 199 (nil)) This is the store instruction, the memory base address is r7, the hfp register, dse think hfp is constant inside the function, so give it a store group 221 (insn 37 36 34 2 (set (reg/f:SI 8 r8 [170]) 222 (reg/f:SI 7 r7)) t.ii:32 694 {*thumb2_movsi_insn} 223 (expr_list:REG_EQUIV (plus:SI (reg/f:SI 7 r7) 224 (const_int 0 [0])) 225 (nil))) This insn move r7 to r8, it also equals to the value of sp 245 (insn 38 35 39 2 (parallel [ 246 (set (reg:SI 0 r0) 247 (mem/s/c:SI (reg/f:SI 8 r8 [170]) [3 tmp1+0 S4 A64])) 248 (set (reg:SI 1 r1) 249 (mem/s/c:SI (plus:SI (reg/f:SI 8 r8 [170]) 250 (const_int 4 [0x4])) [3 tmp1+4 S4 A32])) 251 ]) t.ii:32 369 {*ldm2_ia} 252 (nil)) This is the load instruction, the memory base address is r8, const_or_frame_p returns false for r8, after using cselib_expand_value_rtx to r8, we get a base address sp, const_or_frame_p still return false for it. So the corresponding group id is -1 (no corresponding store group), then it can't match the store insn 12. So dse consider the memory stored in insn 12 is never used, thus a dead store, and can be eliminated. The problem is the hfp based address is considered constant base address, sp and derived addresses are considered varied base address, they will not be matched when detecting interfering memory access. But in many cases sp and hfp can be same. Even worse, addresses copied from or derived from hfp could be recognized as derived from sp, like in this case, and causes memory access mismatch.
[Bug other/54398] Incorrect ARM assembly when building with -fno-omit-frame-pointer -O2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398 --- Comment #8 from Carrot 2012-09-12 20:57:33 UTC --- (In reply to comment #7) > > This rings a bell. > > Maybe the patch mentioned below needs backporting given Carrot is > reporting this against the 4.6 branch. What's not clear if this is > reproducible on anything later though. > > http://old.nabble.com/-PATCH--Prevent-cselib-substitution-of-FP,-SP,-SFP-td33080657.html > The patch can fix this bug.
[Bug c++/54574] New: G++ accepts parameters with wrong types in parent constructor
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54574 Bug #: 54574 Summary: G++ accepts parameters with wrong types in parent constructor Classification: Unclassified Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com When compiling the following source code, class C { public: C (int* Items[]); }; template class A : public C { public: A (int Items[]) : C (Items) {// C is called with wrong parameter type, expects int** }; }; int i[5]; A yyy(i); Trunk g++ silently accepts it. While clang produces following error message: cursesm.ii:11:7: error: no matching constructor for initialization of 'C' : C (Items) { ^ ~ cursesm.ii:4:3: note: candidate constructor not viable: no known conversion from 'int *' to 'int **' for 1st argument; take the address of the argument with & C (int* Items[]); ^ cursesm.ii:1:7: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int *' to 'const C' for 1st argument; class C ^ cursesm.ii:10:3: error: constructor for 'A' must explicitly initialize the base class 'C' which does not have a default constructor A (int Items[]) ^ cursesm.ii:16:8: note: in instantiation of member function 'A::A' requested here A yyy(i); ^ cursesm.ii:1:7: note: 'C' declared here class C ^ It also impacts branches 4.6 and 4.7.
[Bug middle-end/41004] missed merge of basic blocks
--- Comment #4 from carrot at google dot com 2009-08-19 21:55 --- (In reply to comment #2) > Why does the basic block reordering pass also not handle this? > Basic block reordering is disabled with options -Os. The basic block reordering algorithm is for performance only, it usually increases code size. So it won't be called when do optimization for size. But for this specific case, the extra branch can be removed when I compile it with -O2. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004
[Bug c++/3187] gcc lays down two copies of constructors
--- Comment #34 from carrot at google dot com 2009-08-27 01:40 --- There is one optimization that we can do without affecting the ABI and linker compatibility. The delete destructor(D0) always contains the content of complete desturctor(D1) followed by a function call to delete. So instead of cloning the abstract destructor function body to the delete destructor(D0), we can generate a function call to complete destructor(D1) followed by a function call to delete. -- carrot at google dot com changed: What|Removed |Added CC||carrot at google dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=3187
[Bug middle-end/41396] New: missed space optimization related to basic block reorder
Compile the attached source code with options -march=armv5te -mthumb -Os, I got push{r4, lr} ldr r4, [r0, #8] ldr r3, [r0, #4] b .L2 .L7: ldr r2, [r3, #8] ldr r1, [r2] ldr r2, [r3] add r2, r1, r2 ldr r1, [r3, #4] ldr r1, [r1] sub r2, r2, r1 ldr r1, [r3, #12] cmp r1, #1 beq .L4 cmp r1, #2 bne .L3 b .L12 // C .L4: // -BEGIN BLOCK B ldr r1, [r0] neg r1, r1 cmp r2, r1 bge .L3 b .L9// --END BLOCK B .L12: // ---BEGIN BLOCK A--- ldr r1, [r0] cmp r2, r1 bgt .L9 .L3: add r3, r3, #16 .L2: cmp r3, r4 bcc .L7 mov r0, #0 b .L6 // -END BLOCK A- .L9: mov r0, #1 .L6: @ sp needed for prologue pop {r4, pc} If we change the order of block A and block B, we can remove 2 branch instructions, inst C and another inst at the end of block B. Need new basic block reorder algorithm for code size optimization? -- Summary: missed space optimization related to basic block reorder Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396
[Bug middle-end/41396] missed space optimization related to basic block reorder
--- Comment #1 from carrot at google dot com 2009-09-18 07:57 --- Created an attachment (id=18602) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18602&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396
[Bug tree-optimization/41442] New: missed optimization for boolean expression
The boolean expression ((p1->next && !p2->next) || p2->next) can be simplified as (p1->next || p2->next), but gcc failed to detect this. The attached source code is an example, compile it with options -Os -march=armv5te -mthumb, I got push{lr} ldr r3, [r0] cmp r3, #0 beq .L2 ldr r3, [r1]// redundant load and comparison mov r0, #0 cmp r3, #0 // beq .L3 // can branch to L3 directly .L2: ldr r0, [r1] neg r3, r0 adc r0, r0, r3 .L3: @ sp needed for prologue pop {pc} -- Summary: missed optimization for boolean expression Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41442
[Bug tree-optimization/41442] missed optimization for boolean expression
--- Comment #1 from carrot at google dot com 2009-09-23 06:49 --- Created an attachment (id=18634) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18634&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41442
[Bug target/41481] New: missed optimization in cse
Compile following code with options -Os -march=armv5te -mthumb, class A { public: int ah; unsigned field : 2; }; void foo(A* p) { p->ah = 1; p->field = 1; } We can get: mov r3, #1 // A str r3, [r0] ldrbr3, [r0, #4] mov r2, #3 bic r3, r3, r2 mov r2, #1 // B orr r3, r3, r2 strbr3, [r0, #4] @ sp needed for prologue bx lr Both instruction A and B load a constant 1 into register. We can load 1 into r1 in instruction A and use r1 when constant 1 is required. So instruction B can be removed. cse pass doesn't find this opportunity is because it needs all expressions to be of the same mode. But in rtl level the first 1 is in mode SI and the second 1 is in mode QI. Arm doesn't has any physical register of QI mode, so all of them are put into 32 bit physical register and causes redundant load of constant 1. -- Summary: missed optimization in cse Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41481
[Bug target/41481] missed optimization in cse
--- Comment #1 from carrot at google dot com 2009-09-27 09:13 --- Created an attachment (id=18662) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18662&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41481
[Bug target/41514] New: redundant compare instruction of consecutive conditional branches
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc generates: push{lr} cmp r0, #63 // A beq .L3 cmp r0, #63 // B bhi .L4 cmp r0, #45 beq .L3 cmp r0, #47 bne .L5 b .L3 .L4: cmp r0, #99 bne .L5 .L3: mov r0, #1 b .L2 .L5: mov r0, #0 .L2: @ sp needed for prologue pop {pc} Instruction B is the same as instruction A, and there are no other instructions between them clobber condition codes. So we can remove instruction B. -- Summary: redundant compare instruction of consecutive conditional branches Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514
[Bug target/41514] redundant compare instruction of consecutive conditional branches
--- Comment #1 from carrot at google dot com 2009-09-30 08:25 --- Created an attachment (id=18671) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18671&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514
[Bug target/41514] redundant compare instruction of consecutive conditional branches
--- Comment #3 from carrot at google dot com 2009-10-01 07:37 --- (In reply to comment #2) > Where does it come from? (Remember: option -dAP, then look at .s file) > The first several instructions and corresponding rtl patterns are: cmp r0, #63 beq .L3 cmp r0, #63 bhi .L4 (jump_insn 8 3 35 src/./test5.c:3 (set (pc) (if_then_else (eq (reg/v:SI 0 r0 [orig:135 ch ] [135]) (const_int 63 [0x3f])) (label_ref 18) (pc))) 201 {*cbranchsi4_insn} (expr_list:REG_BR_PROB (const_int 2900 [0xb54]) (nil)) -> 18) (note 35 8 9 [bb 3] NOTE_INSN_BASIC_BLOCK) (jump_insn 9 35 36 src/./test5.c:3 (set (pc) (if_then_else (gtu (reg/v:SI 0 r0 [orig:135 ch ] [135]) (const_int 63 [0x3f])) (label_ref 14) (pc))) 201 {*cbranchsi4_insn} (expr_list:REG_BR_PROB (const_int 5000 [0x1388]) (nil)) -> 14) In thumb's instruction patterns, compare and branch instructions can't be expressed separately. So we can't easily remove the second compare instruction in middle end. I just noticed the second conditional branch (larger than 63) is totally unnecessary if we compare the equality with 63, 45, 47, 99 one by one. This is another missed optimization exposed by this test case. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41514
[Bug target/41653] New: not optimal result for multiplication with constant when -Os is specified
Compile the following code with options -Os -mthumb -march=armv5te int mul12(int x) { return x*12; } Gcc generates: lsl r3, r0, #1 add r0, r3, r0 lsl r0, r0, #2 @ sp needed for prologue bx lr This code sequence may be good for speed. But when we optimize for size, we can get shorter code sequence: mov r3, 12 mul r0, r3, r0 bx lr These code is generated by the expand pass. We may consider to generate different instructions when optimize for size. This kind of multiplication is usually found in computing the address of an array element. -- Summary: not optimal result for multiplication with constant when -Os is specified Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41653
[Bug target/41705] New: missed if conversion optimization
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc generates: push{lr} ldr r3, [r0] cmp r3, #0 // B bne .L3 ldr r3, [r0, #4] b .L2 .L3: mov r3, #0 // A .L2: ldr r2, [r0, #8] @ sp needed for prologue ldr r0, [r2] add r0, r3, r0 pop {pc} Instruction A can be moved before instruction B, which should be handled by ifcvt.c:find_if_case_2. Notice the following code in find_if_header: if (dom_info_state (CDI_POST_DOMINATORS) >= DOM_NO_FAST_QUERY && (! HAVE_conditional_execution || reload_completed)) { if (find_if_case_1 (test_bb, then_edge, else_edge)) goto success; if (find_if_case_2 (test_bb, then_edge, else_edge)) goto success; } After reload_completed, the target of conditional assignment is happened to be allocated to the same physical register as the condition variable. This prevent it from moving to the front of compare and branch instructions. Before reload_completed, HAVE_conditional_execution prevent find_if_case_2 to be called. So we missed this optimization chance. Target ARM has conditional execution capability, but thumb actually can't do conditional execution. Do we have any method to let the compiler know this? -- Summary: missed if conversion optimization Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705
[Bug target/41705] missed if conversion optimization
--- Comment #1 from carrot at google dot com 2009-10-14 09:29 --- Created an attachment (id=18798) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18798&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705
[Bug target/41653] not optimal result for multiplication with constant when -Os is specified
--- Comment #2 from carrot at google dot com 2009-10-15 08:18 --- arm_size_rtx_costs calls thumb1_rtx_costs for TARGET_THUMB1. thumb1_rtx_costs is also called by several other functions. Looked at its implementation briefly, it is actually tuned for speed only. Following are some obvious example: case UDIV: case UMOD: case DIV: case MOD: return 100; case TRUNCATE: return 99; So a new function thumb1_size_rtx_costs is required to model the thumb1 size feature, right? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41653
[Bug target/41705] missed if conversion optimization
--- Comment #3 from carrot at google dot com 2009-10-15 08:25 --- > > > > > Target ARM has conditional execution capability, but thumb actually can't do > > conditional execution. Do we have any method to let the compiler know this? > > Note that this is relevant only for Thumb1 and not for Thumb2. Thumb2 has > conditional code generation and GCC does make an effort to generate > conditional > code for it. > > Can we work around this by undef'ing HAVE_conditional_execution in the backend > headers and defining this to TARGET_THUMB1 ? > I will try this method, thank you. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705
[Bug tree-optimization/41778] New: missed dead store elimination
Compile the attached source code with options -Os -march=armv5te -mthumb, gcc generates: push{lr} ldr r3, [r1, #4] // redundant ldrbr3, [r3] // redundant @ sp needed for prologue pop {pc} There are two redundant instructions. Compile it with options -O2 -march=armv5te -mthumb, gcc generates following expected results. foo: @ sp needed for prologue bx lr The optimization done in -O2 is from this patch http://gcc.gnu.org/viewcvs?view=revision&revision=145172. But this piece of code was in pre pass, it is disabled when -Os is specified, so the unoptimized code was passed to rtl passes. In rtl passes, the dead store is caught and removed with some related code, but not all of them were removed, so we can still see the two redundant instruction. We should also add this optimization to dead store elimination pass to benefit -Os cases. -- Summary: missed dead store elimination Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: carrot at google dot com GCC build triplet: i686-linux GCC host triplet: i686-linux GCC target triplet: arm-eabi http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41778
[Bug tree-optimization/41778] missed dead store elimination
--- Comment #1 from carrot at google dot com 2009-10-21 08:50 --- Created an attachment (id=18850) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=18850&action=view) test case -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41778
[Bug middle-end/41762] internal compiler error when compiling xorg-server
--- Comment #9 from carrot at google dot com 2009-10-23 09:15 --- (In reply to comment #5) > This is fixed on trunk by revision 149082: > > http://gcc.gnu.org/ml/gcc-cvs/2009-06/msg01067.html > The patch 149082 contains two parts: 1. fixed a wrong optimization in tree-ssa-sink.c, it affects performance only. 2. fixed a i386 back end bug in i386.c. I've tried the bug fixing code in i386.c, unfortunately it doesn't work. So it looks more like the better optimization in the patch hides an unknown bug. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41762
[Bug target/41705] missed if conversion optimization
--- Comment #4 from carrot at google dot com 2009-10-27 09:15 --- A patch http://gcc.gnu.org/viewcvs?view=revision&revision=153584 has been checked in. -- carrot at google dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||FIXED http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41705
[Bug target/47133] New: code size opportunity for boolean expression evaluation
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47133 Summary: code size opportunity for boolean expression evaluation Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Target: arm-eabi Compile the following code with options -march=armv7-a -mthumb -Os struct S { int f1, f2; }; int t04(int x, struct S* p) { return p->f1 == 9 && p->f2 == 0; } GCC 4.6 generates: t04: ldrr3, [r1, #0] cmpr3, #9 // A bne.L3 ldrr0, [r1, #4] rsbsr0, r0, #1 itcc movccr0, #0 bxlr // C .L3: movsr0, #0 // B bxlr Instruction B can be moved before instruction A, and instruction C can be removed. t04: ldrr3, [r1, #0] movsr0, #0 cmpr3, #9 bne.L3 ldrr0, [r1, #4] rsbsr0, r0, #1 itcc movccr0, #0 .L3: bxlr When compiled to arm instructions, it has the same problem. It should be enabled for code size optimization only because it may execute one more instruction run time. Looks like an if-conversion opportunity.
[Bug rtl-optimization/47373] New: avoid goto table to reduce code size when optimized for size
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47373 Summary: avoid goto table to reduce code size when optimized for size Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Host: linux Target: arm-linux-androideabi Created attachment 23040 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23040 modified testcase When I compiled the infback.c from zlib 1.2.5 with options -march=armv7-a -mthumb -Os, gcc 4.6 generates following code for a large switch statement: subsr3, r3, #11 cmpr3, #18 bhi.L16 tbh[pc, r3, lsl #1] .L23: .2byte(.L17-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L18-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L154-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L20-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L16-.L23)/2 .2byte(.L21-.L23)/2 .2byte(.L121-.L23)/2 .L17: GCC generates a goto table for 19 cases. The table and the instructions which manipulate it occupies 19*2 + 10 = 48 bytes. Actually most of the targets in the table are same. There are only 6 targets other than .L16. So if we generate a sequence of cmp & br instructions, we need only 6 cmp&br and one br to default, that's only 4*6+2=26 bytes. When I randomly modified the source code, gcc sometimes generate the absolute address in the goto table, double the table size, make result worse. The modified source code is attached.
[Bug rtl-optimization/47454] New: registers are not allocated according to its preferred order
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47454 Summary: registers are not allocated according to its preferred order Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Target: arm-linux-androideabi Created attachment 23115 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23115 testcase The attached test case is extracted from zlib, when compiled by gcc4.6 with options -march=armv7-a -mthumb -Os, I got the following code push{r4, r5, r6, r7, r8, lr} movr6, r0 movr4, r1 cmpr0, #0 beq.L2 ldrr5, [r0, #0] cmpr5, #0 beq.L2 cmpr1, #0 bge.L3 negsr4, r1 movsr7, #0 b.L4 .L3: asrsr7, r1, #4 addsr7, r7, #1 cmpr1, #47 itle andler4, r1, #15 .L4: addsr3, r4, #0 subr8, r4, #8 itne movner3, #1 cmpr8, #7 itels movlsr8, #0 andhir8, r3, #1 cmpr8, #0 bne.L2 ldrr1, [r5, #8] cbzr1, .L5 ldrr3, [r5, #4] cmpr3, r4 beq.L5 ldrr3, [r6, #4] ldrr0, [r6, #8] blxr3 strr8, [r5, #8] .L5: strr7, [r5, #0] movr0, r6 strr4, [r5, #4] pop{r4, r5, r6, r7, r8, lr} binflateReset .L2: mvnr0, #1 pop{r4, r5, r6, r7, r8, pc} Note that register r8 is used many times, but register r2 is never used. In thumb2 r8 is high register, its usage will cause 32bit instructions. If we replace r8 with r2, a lot of code size will be reduced in this case. In arm.h REG_ALLOC_ORDER is defined as 3, 2, 1, 0, 12, 14, 4, 5, 6, 7, 8, 10, 9, 11, 13, 15 ... We can see that r2 should be used before r8, but the result is not.
[Bug rtl-optimization/47454] registers are not allocated according to its preferred order
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47454 --- Comment #3 from Carrot 2011-01-31 08:48:40 UTC --- (In reply to comment #2) > -frename-registers should help for this issue on the ARM. All of r8 can be renamed to r2, in this case only two of them have been renamed.
[Bug target/47764] New: The constant load instruction should be hoisted out of loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764 Summary: The constant load instruction should be hoisted out of loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Target: arm-linux-androideabi Created attachment 23359 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23359 testcase The attached test case is extracted from zlib. Compile it with options -march=armv7-a -mthumb -Os, gcc 4.6 generates: init_block: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movsr3, #0 .L2: addsr2, r0, r3 addsr3, r3, #4 movsr1, #0 // A cmpr3, #1144 strhr1, [r2, #60]@ movhi // B bne.L2 movsr3, #0 .L3: addsr2, r0, r3 addsr3, r3, #4 movsr1, #0 // C cmpr3, #120 strhr1, [r2, #2352]@ movhi bne.L3 movsr2, #0 .L4: addsr1, r0, r2 addsr2, r2, #4 movsr3, #0 // D cmpr2, #76 strhr3, [r1, #2596]@ movhi bne.L4 movsr2, #1 strr3, [r0, #2760] strhr2, [r0, #1084]@ movhi strr3, [r0, #2756] strr3, [r0, #2764] strr3, [r0, #2752] bxlr Note that instruction A in loop L2 loads constant 0 to register r1, then instruction B stores r1 into memory. There is no other usage of r1 in the loop. So it's better to move instruction A out of the loop. Similarly instruction C can be moved out of loop L3. Actually it can be removed since after instruction A the register r1 already contains 0 and no instruction modify it later. Similarly instruction D cam be moved out of loop L4. It can also be removed if we exchange the register usage of r1 and r3 in loop L4.
[Bug target/47777] New: use __aeabi_idivmod to compute quotient and remainder at the same time
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=4 Summary: use __aeabi_idivmod to compute quotient and remainder at the same time Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Target: arm-eabi Compile the following source code with options -march=armv7-a -O2 int t06(int x, int y) { int a = x / y; int b = x % y; return a+b; } GCC 4.6 generates: t06: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 stmfdsp!, {r4, r5, r6, lr} movr6, r0 movr5, r1 bl__aeabi_idiv movr1, r5 movr4, r0 movr0, r6 bl__aeabi_idivmod addr0, r4, r1 ldmfdsp!, {r4, r5, r6, pc} It calls function __aeabi_idiv to compute quotient and calls __aeabi_idivmod to compute remainder. Actually __aeabi_idivmod can compute quotient and remainder at the same time. By taking advantage of this we can simplify the code to push{r4, lr} bl__aeabi_idivmod addr0, r0, r1 pop{r4, pc}
[Bug target/47764] The constant load instruction should be hoisted out of loop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47764 --- Comment #3 from Carrot 2011-02-21 03:15:45 UTC --- > Any ideas of how this improvement could be implemented, Carrot? The root cause of this problem is that arm/thumb store instruction can't directly store a immediate number to memory, but gcc doesn't realize this early enough. In most part of the rtl phase, the following form is kept. (insn 41 38 42 3 (set (mem:HI (plus:SI (reg/f:SI 169) (const_int 60 [0x3c])) [2 MEM[(struct deflate_state *)D.2085 _3 + 60B]+0 S2 A16]) (const_int 0 [0])) src/trees.c:45 696 {*thumb2_movhi_insn} (expr_list:REG_DEAD (reg/f:SI 169) (nil))) Until register allocation it finds the restriction of the store instruction and split it into two instructions, load 0 into register and store register to memory. But it's too late to do a loop optimization. One possible method is to split this insn earlier than loop optimization (maybe directly in expand pass), and let loop and cse optimizations do the rest. It may increase register pressure in part of the program, we should rematerialize it in such cases.
[Bug target/47831] New: avoid if-convertion if the conditional instructions and following conditional branch has the same condition
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47831 Summary: avoid if-convertion if the conditional instructions and following conditional branch has the same condition Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassig...@gcc.gnu.org ReportedBy: car...@google.com Target: arm-linux-androideabi Created attachment 23423 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23423 testcase Compile the attached source code with options -march=armv7-a -mthumb -Os, GCC 4.6 generates ras_validate: @ args = 0, pretend = 0, frame = 8 @ frame_needed = 0, uses_anonymous_args = 0 push{r0, r1, r4, r5, r6, lr} addr4, sp, #4 movsr2, #4 movr1, r4 movr5, r0 blfoo cmpr0, #0 itge// A movger6, r0// B bge.L3 // C b.L7 // D .L4: addsr3, r6, r4 movr0, r5 subsr6, r6, #1 ldrbr1, [r3, #-1]@ zero_extendqisi2 blbar addsr3, r0, #1 beq.L2 .L3: cmpr6, #0 bne.L4 movr0, r6 b.L2 .L7: movr0, #-1 .L2: pop{r2, r3, r4, r5, r6, pc} Instruction sequence ABCD can be replaced with blt.L7 movr6, r0 b .L3 In both cases (lt or ge) the executed instructions is not longer than original code. So it's shorter and faster.