On Sun, 2013-08-18 at 00:55 -0400, Asm Twiddler wrote: > Hello all, > > I'm not sure whether this has been posted before, but gcc creates > slightly inefficient code for large integers in several cases: >
I'm not sure what the actual question is. Bug reports and enhancement suggestions of that kind usually go to bugzilla and you should also specify which compiler version you're referring to. Anyway, I've tried your examples on SH (4.9) which also does 64 bit operations with stitched 32 bit ops. > unsigned long long val; > > void example1() { > val += 0x800000000000ULL; > } > > On x86 this results in the following assembly: > addl $0, val > adcl $32768, val+4 > ret This is probably because if a target defines a plus:DI / minus:DI patterns (which is most likely to be the case, because of carry / borrow bit handling peculiarities) these kind of zero bits special cases will not be handled automatically. Another example would be: unsigned long long example11 (unsigned long long val, unsigned long x) { val += (unsigned long long)x << 32; return val; } > The first add is unnecessary as it shouldn't modify val or set the carry. > This isn't too bad, but compiling for a something like AVR, results in > 8 byte loads, followed by three additions (of the high bytes), > followed by another 8 byte saves. > The compiler doesn't recognize that 5 of those loads and 5 of those > saves are unnecessary. This is probably because of the same or similar reason as mentioned above. I've tried the following: void example4 (unsigned long long* x) { *x |= 1; } and it results in: mov.l @(4,r4),r0 or #1,r0 rts mov.l r0,@(4,r4) So I guess the fundamental subreg load/store handling seems to work. > Here is another inefficiency for x86: > > unsigned long long val = 0; > unsigned long small = 0; > > unsigned long long example1() { > return val | small; > } > > unsigned long long example2() { > return val & small; > } > > The RTL's generated for example1 and example2 are very similar until > the fwprop1 stage. > Since the largest word size on x86 is 4 bytes, each operation is > actually split into two. > The forward propagator correctly realizes that anding the upper 4 > bytes results in a zero. > However, it doesn't seem to recognize that oring the upper 4 bytes > should return val's high word. > This problem also occurs in the xor operation, and also when > subtracting (val - small). In my case the double ior:SI and and:SI operations are eliminated in the .cse1 pass and the resulting code is optimal. My impression is that the stitched multiword add/sub thing could be addressed in a target independent way so that it would work for all affected targets automatically. The other issues seem to be individual target problems. Cheers, Oleg