On Sun, 2013-08-18 at 00:55 -0400, Asm Twiddler wrote:
> Hello all,
> 
> I'm not sure whether this has been posted before, but gcc creates
> slightly inefficient code for large integers in several cases:
> 

I'm not sure what the actual question is.
Bug reports and enhancement suggestions of that kind usually go to
bugzilla and you should also specify which compiler version you're
referring to.

Anyway, I've tried your examples on SH (4.9) which also does 64 bit
operations with stitched 32 bit ops.

> unsigned long long val;
> 
> void example1() {
>     val += 0x800000000000ULL;
> }
> 
> On x86 this results in the following assembly:
> addl $0, val
> adcl $32768, val+4
> ret

This is probably because if a target defines a plus:DI / minus:DI
patterns (which is most likely to be the case, because of carry / borrow
bit handling peculiarities) these kind of zero bits special cases will
not be handled automatically.
Another example would be:

unsigned long long example11 (unsigned long long val, unsigned long x)
{
  val += (unsigned long long)x << 32;
  return val;
}

> The first add is unnecessary as it shouldn't modify val or set the carry.
> This isn't too bad, but compiling for a something like AVR, results in
> 8 byte loads, followed by three additions (of the high bytes),
> followed by another 8 byte saves.
> The compiler doesn't recognize that 5 of those loads and 5 of those
> saves are unnecessary.

This is probably because of the same or similar reason as mentioned
above.  I've tried the following:

void example4 (unsigned long long* x)
{
  *x |= 1;
}

and it results in:
        mov.l   @(4,r4),r0
        or      #1,r0   
        rts
        mov.l   r0,@(4,r4)

So I guess the fundamental subreg load/store handling seems to work.


> Here is another inefficiency for x86:
> 
> unsigned long long val = 0;
> unsigned long small = 0;
> 
> unsigned long long example1() {
>     return val | small;
> }
> 
> unsigned long long example2() {
>     return val & small;
> }
> 
> The RTL's generated for example1 and example2 are very similar until
> the fwprop1 stage.
> Since the largest word size on x86 is 4 bytes, each operation is
> actually split into two.
> The forward propagator correctly realizes that anding the upper 4
> bytes results in a zero.
> However, it doesn't seem to recognize that oring the upper 4 bytes
> should return val's high word.
> This problem also occurs in the xor operation, and also when
> subtracting (val - small).

In my case the double ior:SI and and:SI operations are eliminated in
the .cse1 pass and the resulting code is optimal.

My impression is that the stitched multiword add/sub thing could be
addressed in a target independent way so that it would work for all
affected targets automatically.
The other issues seem to be individual target problems.

Cheers,
Oleg

Reply via email to