https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241
Bug ID: 61241 Summary: built-in memset makes the caller function slower than normal memset Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ma.jiang at zte dot com.cn Compiled with -O2, #include <string.h> extern int off; void *test(char *a1, char* a2) { memset(a2, 123, 123); return a2 + off; } gives a result as following. mov ip, r1 mov r1, #123 stmfd sp!, {r3, lr} mov r0, ip mov r2, r1 bl memset movw r3, #:lower16:off movt r3, #:upper16:off mov ip, r0 ldr r0, [r3] add r0, ip, r0 ldmfd sp!, {r3, pc} After adding -fno-builtin, the assemble code becomes shorter. stmfd sp!, {r4, lr} mov r4, r1 mov r1, #123 mov r0, r4 mov r2, r1 bl memset movw r3, #:lower16:off movt r3, #:upper16:off ldr r0, [r3] add r0, r4, r0 ldmfd sp!, {r4, pc} One reason is that arm_eabi must align stack to 8 bytes, so it push a meaningless r3. But that is not the most important reason. When using built-in memset, ira can know that memset does not change the value of r0. Then choosing r0 instead of ip is clearly more profitable, because this choice can get rid of the redundant "mov ip,r0; mov r0,ip;" pair. For this rtl sequence: (insn 7 8 9 2 (set (reg:SI 0 r0) (reg/v/f:SI 115 [ a2 ])) open_test.c:5 186 {*arm_movsi_insn} (nil)) (insn 9 7 10 2 (set (reg:SI 2 r2) (reg:SI 1 r1)) open_test.c:5 186 {*arm_movsi_insn} (expr_list:REG_EQUAL (const_int 123 [0x7b]) (nil))) (call_insn 10 9 24 2 (parallel [ (set (reg:SI 0 r0) (call (mem:SI (symbol_ref:SI ("memset") [flags 0x41] <function_decl 0xb7d72500 memset>) [0 __builtin_memset S4 A32]) (const_int 0 [0]))) (use (const_int 0 [0])) (clobber (reg:SI 14 lr)) ]) open_test.c:5 251 {*call_value_symbol} (expr_list:REG_RETURNED (reg/v/f:SI 115 [ a2 ]) (expr_list:REG_DEAD (reg:SI 2 r2) (expr_list:REG_DEAD (reg:SI 1 r1) (expr_list:REG_UNUSED (reg:SI 0 r0) (expr_list:REG_EH_REGION (const_int 0 [0]) (nil)))))) (expr_list:REG_CFA_WINDOW_SAVE (set (reg:SI 0 r0) (reg:SI 0 r0)) (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 2 r2)) (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 1 r1)) (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 0 r0)) (nil)))))) Assigning r0 to r115 was blocked by two pieces of code in process_bb_node_lives(In ira-lives.c). 1: call_p = CALL_P (insn); for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++) if (!call_p || !DF_REF_FLAGS_IS_SET (*def_rec, DF_REF_MAY_CLOBBER)) mark_ref_live (*def_rec); 2: /* Mark each used value as live. */ for (use_rec = DF_INSN_USES (insn); *use_rec; use_rec++) mark_ref_live (*use_rec); In piece 1, "set (reg:SI 0 ) (reg/v/f:SI 115)" will make r0 conflict with r115 when r115 is living. This is not necessary as "set (reg:SI 0) (reg:SI 0)" will not hurt any other instruction. Making r0 conflict with all living pseudo registers will lose the chance to optimize a set instruction. I think at least for a simple single set, we should not make the source register conflict with the dest register when one of them is hard register and the other is not. In piece 2, after call memset, r0 will become living and then conflict with living r115. This code neglect that r115 is the result of find_call_crossed_cheap_reg, and in fact r115 is the same as r0. As discussed above, the two pieces of code block the ira to do a more profitable choice.I have build a patch to fix this problem. After the patch, the assemble code with built-in memset become shorter than normal memset. mov r0, r1 mov r1, #123 stmfd sp!, {r3, lr} mov r2, r1 bl memset movw r3, #:lower16:off movt r3, #:upper16:off ldr r3, [r3] add r0, r0, r3 ldmfd sp!, {r3, pc} I have done a "bootstrap" and "make check" on x86, nothing change after the patch. Is that patch OK for trunk?