https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61241

            Bug ID: 61241
           Summary: built-in memset makes the caller function slower than
                    normal memset
           Product: gcc
           Version: 4.10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ma.jiang at zte dot com.cn

Compiled with  -O2, 

#include <string.h>
extern int off;
void *test(char *a1, char* a2)
{
        memset(a2, 123, 123);
        return a2 + off;
}

gives a result as following.

        mov     ip, r1
        mov     r1, #123
        stmfd   sp!, {r3, lr}
        mov     r0, ip
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        mov     ip, r0
        ldr     r0, [r3]
        add     r0, ip, r0
        ldmfd   sp!, {r3, pc}

After adding -fno-builtin, the assemble code becomes shorter.

        stmfd   sp!, {r4, lr}
        mov     r4, r1
        mov     r1, #123
        mov     r0, r4
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        ldr     r0, [r3]
        add     r0, r4, r0
        ldmfd   sp!, {r4, pc}

One reason is that arm_eabi must align stack to 8 bytes, so it push a
meaningless r3. But that is not the most important reason.

When using built-in memset, ira can know that memset does not change the value
of r0. Then choosing r0 instead of ip is clearly more profitable, because this
choice can get rid of the redundant "mov ip,r0; mov r0,ip;" pair.

For this rtl sequence:

(insn 7 8 9 2 (set (reg:SI 0 r0)
        (reg/v/f:SI 115 [ a2 ])) open_test.c:5 186 {*arm_movsi_insn}
     (nil))
(insn 9 7 10 2 (set (reg:SI 2 r2)
        (reg:SI 1 r1)) open_test.c:5 186 {*arm_movsi_insn}
     (expr_list:REG_EQUAL (const_int 123 [0x7b])
        (nil)))
(call_insn 10 9 24 2 (parallel [
            (set (reg:SI 0 r0)
                (call (mem:SI (symbol_ref:SI ("memset") [flags 0x41] 
<function_decl 0xb7d72500 memset>) [0 __builtin_memset S4 A32])
                    (const_int 0 [0])))
            (use (const_int 0 [0]))
            (clobber (reg:SI 14 lr))
        ]) open_test.c:5 251 {*call_value_symbol}
     (expr_list:REG_RETURNED (reg/v/f:SI 115 [ a2 ])
        (expr_list:REG_DEAD (reg:SI 2 r2)
            (expr_list:REG_DEAD (reg:SI 1 r1)
                (expr_list:REG_UNUSED (reg:SI 0 r0)
                    (expr_list:REG_EH_REGION (const_int 0 [0])
                        (nil))))))
    (expr_list:REG_CFA_WINDOW_SAVE (set (reg:SI 0 r0)
            (reg:SI 0 r0))
        (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 2 r2))
            (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 1 r1))
                (expr_list:REG_CFA_WINDOW_SAVE (use (reg:SI 0 r0))
                    (nil))))))

Assigning r0 to r115 was blocked by two pieces of code in
process_bb_node_lives(In ira-lives.c).

1:
      call_p = CALL_P (insn);
      for (def_rec = DF_INSN_DEFS (insn); *def_rec; def_rec++)
        if (!call_p || !DF_REF_FLAGS_IS_SET (*def_rec, DF_REF_MAY_CLOBBER))
          mark_ref_live (*def_rec);
2:
      /* Mark each used value as live.  */
      for (use_rec = DF_INSN_USES (insn); *use_rec; use_rec++)
        mark_ref_live (*use_rec);

In piece 1, "set (reg:SI 0 )  (reg/v/f:SI 115)" will make r0 conflict with 
r115 when r115 is living. This is not necessary as "set (reg:SI 0) (reg:SI 0)"
will not hurt any other instruction. Making r0 conflict with all living pseudo
registers will lose the chance to optimize a set instruction. I think at least
for a simple single set, we should not make the source register conflict with
the dest register when one of them is hard register and the other is not.

In piece 2, after call memset, r0 will become living and then conflict with
living r115. This code neglect that r115 is the result of
find_call_crossed_cheap_reg, and in fact r115 is the same as r0.

As discussed above, the two pieces of code block the ira to do a more
profitable choice.I have build a patch to fix this problem. After the patch,
the assemble code with built-in memset become shorter than normal memset.

        mov     r0, r1
        mov     r1, #123
        stmfd   sp!, {r3, lr}
        mov     r2, r1
        bl      memset
        movw    r3, #:lower16:off
        movt    r3, #:upper16:off
        ldr     r3, [r3]
        add     r0, r0, r3
        ldmfd   sp!, {r3, pc}

I have done a "bootstrap" and "make check" on x86, nothing change after the
patch. Is that patch OK for trunk?

Reply via email to