Guys,

I'm trying to make compiler to generate better code on superscalar
in-order machine but can't find the right way to do it.

Imagine the following code:

long f(long* p, long a, long b)
{
  long a1 = a << 2;
  long a2 = a1 + b;
  return p[a1] + p[a2];
}

by default compiler generates something like this in some pseudo-asm:

        shl     r3, r3, 2
        add     r4, r3, r4
        ld8    r15, [r2 + r3 * 8]
        ld8    r2, [ r2 + r4 * 8]
       {    add     r2, r2, r15  ;    ret    }

but it would be way better this way:

  {   sh_add  r4, r4, (r3 << 2)   ; shl   r3, r3, 2  }
  {   ld8    r15, [r2 + r3 * 8]   ;   ld8    r2, [ r2 + r4 * 8] }
  {    add     r2, r2, r15  ;    ret    }

2nd sequence is 2 cycles shorter. Combiner pass even shows patterns
like this but fail to transform this as it wrapped in parallel:

Failed to match this instruction:
(parallel [
        (set (reg:DI 56)
            (plus:DI (mult:DI (reg:DI 3 r3 [ a ])
                    (const_int 4 [0x4]))
                (reg:DI 4 r4 [ b ])))
        (set (reg/v:DI 40 [ a1 ])
            (ashift:DI (reg:DI 3 r3 [ a ])
                (const_int 2 [0x2])))
    ])

What would be a proper way to perform reorganizations like this in general way?

The same goes with the pointer increment:

add r2, r2, 1
ld r3, [r2+0]

would be much better off like this:

{ ld r3, [r2 + 1] ; add r2, r2, 1 }

Are those kind of things overlooked or I failed to set something in
machine-dependent portion?

Thanks a lot for your thoughts

Reply via email to