I've noticed while tinkering with 3.4 and 4.1 that some code sequences turn out much better in 4.1. However, other code sequences turn out substantially worse in 4.1.
The most frustrating is the reduction in use of postmodify addressing modes. It looks like tree-ssa-loop-ivopts converts a loop like: for (i = 0; i < MAX; i++) { sum += a[i]; } into something like: for (ivtmp = 0; ivtmp < MAX*4; ivtmp += 4) { sum += *(a+ivtmp) } which is fine, except by the time we get to RTL, the load in the first loop form is converted in GCC 3.4 into a load with postincrement, and the RTL optimization turns the second form address into an add of ivtmp and 4, an add of ivtmp and a, and a load. Similarly, I can do mulsidi3 fast but muldi3 not so fast. If I have a code sequence like: typedef long long int Word64; extern short sat64_16(Word64 x); #define MAC(C,A,B) ((C) + ((Word64)(A) * (B))) #define SEQ(X) { \ c1 = *coef; coef++; \ c2 = *coef; coef++; \ vLo = *(vb1+(X)); \ vHi = *(vb1+(23-(X))); \ sum1L = MAC(sum1L,vLo,c1); \ sum2L = MAC(sum2L,vLo,c2); \ sum1L = MAC(sum1L,vHi,-c2); \ sum2L = MAC(sum2L,vHi,c1); \ vLo = *(vb1+32+(X)); \ vHi = *(vb1+32+(23-(X))); \ sum1R = MAC(sum1R,vLo,c1); \ sum2R = MAC(sum2R,vLo,c2); \ sum1R = MAC(sum1R,vHi,-c2); \ sum2R = MAC(sum2R,vHi,c1); \ } foo(const int *coef, int *vb1, short *out) { int vLo, vHi, c1, c2; Word64 sum1L = 0, sum2L = 0; Word64 sum1R = 0, sum2R = 0; SEQ(0); SEQ(1); SEQ(2); SEQ(3); SEQ(4); SEQ(5); SEQ(6); SEQ(7); out[0] = sat64_16(sum1L+sum2L); out[1] = sat64_16(sum1R+sum2R); } In GCC 3.4, the optimizer has no problem knowing that every multiply is a mulsidi3. In GCC 4.1, the tree optimizer decides that the sign extend to DI for c1, c2, vLo, and VHi should be done into a DImode temporary that is fed to the MAC patterns, and combine dosn't convert them. Indeed, if I don't have a define_insn_and_split for DImode it doesn't even have a chance, because the RTL expander has already converted the DImode multiply into various SImode instructions. So... how do I coax GCC 4.1 into liking postmodify and mulsidi again? I've tried fiddling with rtx_costs for postmodify and multiply, and they should be accurate, but I get no love. What other things can I try to play with? Or is this sort of thing a known deficiency in 4.1 that I should try to work around? I've attached a test for the latter case and the 3.4(.2) and 4.1(.1) assembly outputs for ARM, which exhibits this behavior. Note particularly the smull's and smlal's. Thanks! Erich
simple.c
Description: Binary data
simple-3.4.s
Description: Binary data
simple-4.1.s
Description: Binary data