Hello, I am tracking a performance and size regression from 4.5.4 present in trunk. Consider the following function: == extern short delayLength; typedef int Sample; extern Sample *temp_ptr; extern Sample x;
void foo (short blockSize) { short i; unsigned short loopCount; loopCount = (unsigned short) (blockSize + delayLength) % 8; for (i = 0; i < loopCount; i++) *temp_ptr++ = x ^ *temp_ptr++; } == For v850, before the commit commit e0ae2fe2a0bebe9de31e3d8eb4feace4909ef009 Author: vries <vries@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Fri May 20 19:32:30 2011 +0000 2011-05-20 Tom de Vries <t...@codesourcery.com> PR target/45098 * tree-ssa-loop-ivopts.c: Include expmed.h. (get_shiftadd_cost): New function. (force_expr_to_var_cost): Declare forward. Use get_shiftadd_cost. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@173976 138bc75d-0d04-0410-961f-82ee72b054a4 gcc generated for -O2: _foo: movhi hi(_delayLength),r0,r10 ld.h lo(_delayLength)[r10],r10 add r10,r6 andi 7,r6,r10 be .L1 movhi hi(_temp_ptr),r0,r16 ld.w lo(_temp_ptr)[r16],r15 mov r10,r17 shl 2,r17 mov r15,r14 movhi hi(_x),r0,r13 mov r15,r10 add r17,r14 movea lo(_x),r13,r13 .L3: ld.w 0[r10],r11 ld.w 0[r13],r12 xor r12,r11 st.w r11,0[r10] add 4,r10 cmp r14,r10 bne .L3 mov r15,r10 add r17,r10 st.w r10,lo(_temp_ptr)[r16] .L1: jmp [r31] After the commit it generates: _foo: movhi hi(_delayLength),r0,r10 ld.h lo(_delayLength)[r10],r16 add r16,r6 andi 7,r6,r16 be .L1 movhi hi(_temp_ptr),r0,r17 ld.w lo(_temp_ptr)[r17],r18 movhi hi(_x),r0,r14 mov r18,r11 mov r16,r15 mov 0,r10 movea lo(_x),r14,r14 .L3: ld.w 0[r11],r12 ld.w 0[r14],r13 add 1,r10 xor r13,r12 shl 16,r10 st.w r12,0[r11] sar 16,r10 add 4,r11 cmp r15,r10 bne .L3 shl 2,r16 add r18,r16 st.w r16,lo(_temp_ptr)[r17] .L1: jmp [r31] The problem is inside the loop: shl 16,r10 st.w r12,0[r11] sar 16,r10 add 4,r11 cmp r15,r10 shl followed by sar is used to sign extend r10 which was in previous gcc versions not being done and it is unnecessary. At the point of commit v850 didn't have e3v5 support or zero overhead loops but now it does and this blocks generation of zero overhead loops. (with trunk and -mv850e3v5, gcc generates a sxh instruction instead of the shift pattern but the point is the same). For mep the situation repeats. mep generates: foo: # frame: 8 8 regs lh $10, %sdaoff(delayLength)($gp) add $sp, -8 add3 $1, $1, $10 and3 $10, $1, 0x7 beqz $10, .L1 lw $2, %sdaoff(temp_ptr)($gp) mov $1, 0 add3 $11, $gp, %sdaoff(x) bra .L5 .L3: mov $2, $9 .L5: lw $0, ($11) lw $3, 4($2) add $1, 1 exth $1 xor $3, $0 slt3 $0, $1, $10 add3 $9, $2, 8 sw $3, ($2) bnez $0, .L3 sw $9, %sdaoff(temp_ptr)($gp) .L1: add $sp, 8 ret Again exth signextends $1 and blocks generation of zero overhead loop because suddenly loop is not simple anymore. Unfortunately I cannot test mep before the patch as at the time mep was not in mainline. Does anyone understand why the mentioned patch is forcing the generation of the sign extend inside the loop? Is this just a problem with cost calculation in the backends or some issue lurking in tree-ssa-loop-ivopts? Thanks, Paulo Matos