3.4 vs. 4.1 performance issues

Erich Plondke Tue, 26 Sep 2006 11:03:05 -0700

I've noticed while tinkering with 3.4 and 4.1 that some
code sequences turn out much better in 4.1.  However, other
code sequences turn out substantially worse in 4.1.


The most frustrating is the reduction in use of postmodify
addressing modes.  It looks like tree-ssa-loop-ivopts converts
a loop like:

       for (i = 0; i < MAX; i++) {
               sum += a[i];
       }

into something like:

       for (ivtmp = 0; ivtmp < MAX*4; ivtmp += 4)
       {
               sum += *(a+ivtmp)
       }

which is fine, except by the time we get to RTL, the load in
the first loop form is converted in GCC 3.4 into a load with
postincrement, and the RTL optimization turns the second form
address into an add of ivtmp and 4, an add of ivtmp and a,
and a load.


Similarly, I can do mulsidi3 fast but muldi3 not so fast.
If I have a code sequence like:

typedef long long int Word64;

extern short sat64_16(Word64 x);

#define MAC(C,A,B) ((C) + ((Word64)(A) * (B)))

#define SEQ(X) { \
   c1 = *coef; coef++; \
   c2 = *coef; coef++; \
   vLo = *(vb1+(X)); \
   vHi = *(vb1+(23-(X))); \
   sum1L = MAC(sum1L,vLo,c1); \
   sum2L = MAC(sum2L,vLo,c2); \
   sum1L = MAC(sum1L,vHi,-c2); \
   sum2L = MAC(sum2L,vHi,c1); \
   vLo = *(vb1+32+(X)); \
   vHi = *(vb1+32+(23-(X))); \
   sum1R = MAC(sum1R,vLo,c1); \
   sum2R = MAC(sum2R,vLo,c2); \
   sum1R = MAC(sum1R,vHi,-c2); \
   sum2R = MAC(sum2R,vHi,c1); \
   }

foo(const int *coef, int *vb1, short *out) {
   int vLo, vHi, c1, c2;
   Word64 sum1L = 0, sum2L = 0;
   Word64 sum1R = 0, sum2R = 0;

   SEQ(0);
   SEQ(1);
   SEQ(2);
   SEQ(3);
   SEQ(4);
   SEQ(5);
   SEQ(6);
   SEQ(7);
   out[0] = sat64_16(sum1L+sum2L);
   out[1] = sat64_16(sum1R+sum2R);
}



In GCC 3.4, the optimizer has no problem knowing that every
multiply is a mulsidi3.  In GCC 4.1, the tree optimizer
decides that the sign extend to DI for c1, c2, vLo, and VHi
should be done into a DImode temporary that is fed to the
MAC patterns, and combine dosn't convert them.  Indeed,
if I don't have a define_insn_and_split for DImode it doesn't
even have a chance, because the RTL expander has already
converted the DImode multiply into various SImode instructions.

So... how do I coax GCC 4.1 into liking postmodify and mulsidi again?  I've
tried fiddling with rtx_costs for postmodify and multiply, and they should be
accurate, but I get no love.  What other things can I try to play with?  Or
is this sort of thing a known deficiency in 4.1 that I should try to
work around?

I've attached a test for the latter case and the 3.4(.2) and 4.1(.1)
assembly outputs
for ARM, which exhibits this behavior.  Note particularly the smull's
and smlal's.

Thanks!

   Erich

simple.c
Description: Binary data

simple-3.4.s
Description: Binary data

simple-4.1.s
Description: Binary data

3.4 vs. 4.1 performance issues

Reply via email to