Possible suboptimal code generated in 32-bit ABI mode

bradfirj Sun, 19 Nov 2017 08:21:00 -0800

Hello,

I was playing around with 64bit arithmetic with -m32 enabled and
encountered some strange optimization in what I thought was a very
simple case.


My test function, which I appreciate is totally artifical, is as
follows:

uint64_t sum(uint64_t a, uint64_t b) {
   return a + b;
}

This is obviously a single instruction on a 64bit machine, but I
compiled with -m32 -O2 to see how the compiler would emulate this addition
and by default I see the behaviour I expect, a 64bit addition emulated
using only the 32bit registers:

 # 64-m32-example.cpp:6:     return a + b;
        mov     eax, DWORD PTR [esp+12] # b, b
        add     eax, DWORD PTR [esp+4]  # tmp90, a
        mov     edx, DWORD PTR [esp+16] # b, b
        adc     edx, DWORD PTR [esp+8]  #, a
 # 64-m32-example.cpp:7: }
        ret

However when I compile with -m32 -O2 -march=broadwell, or -march=native,
I see the following code being generated instead:

        vmovq   xmm1, QWORD PTR [esp+12]        # b, b
 # 64-m32-example.cpp:6:     return a + b;
        vmovq   xmm0, QWORD PTR [esp+4] # tmp92, a
        vpaddq  xmm0, xmm0, xmm1        # tmp90, tmp92, b
        vmovd   eax, xmm0       # tmp93, tmp90
        vpextrd edx, xmm0, 1    # tmp94, tmp90,
 # 64-m32-example.cpp:7: }
        ret

I found it fascinating that using the SIMD instructions for such a
simple function would be the optimal approach, so I ran a microbenchmark
using hayai, and the results are quite interesting.

For the simple case, using mov, add and adc, the operation is so fast
that it's beyond the resolution of my benchmark ('inf'):
----------
Average time: 0.006 us (~0.095 us)
Fastest time: 0.000 us (-0.006 us / -100.000 %)
Slowest time: 3.958 us (+3.952 us / +68209.689 %)
 Median time: 0.000 us (1st quartile: 0.000 us | 3rd quartile: 0.000 us)

Average performance: 172586379.48293 runs/s

   Best performance: inf runs/s (+inf runs/s / +inf %)
  Worst performance: 252652.85498 runs/s (-172333726.62795 runs/s / -99.85361 %)
 Median performance: inf runs/s (1st quartile: inf | 3rd quartile: inf)
----------


For the code using xmm0 and xmm1:
----------
Average time: 24.901 us (~1.144 us)
Fastest time: 23.867 us (-1.034 us / -4.153 %)
Slowest time: 61.867 us (+36.966 us / +148.451 %)
 Median time: 24.867 us (1st quartile: 24.867 us | 3rd quartile: 24.867 us)

Average performance: 40158.86848 runs/s

   Best performance: 41898.85616 runs/s (+1739.98768 runs/s / +4.33276 %)
  Worst performance: 16163.70601 runs/s (-23995.16247 runs/s / -59.75059 %)
 Median performance: 40213.93815 runs/s (1st quartile: 40213.93815 | 3rd 
quartile: 40213.93815)
----------


Can anyone explain why the optimizer is producing this output? I don't
pretend to have any knowledge of how these decisions are made, perhaps
using the SIMD instructions result in a higher throughput which is
invisible in my contrived benchmarks?

For reference, I am compiling everything using gcc trunk, at commit
254929 from Sun Nov 19, and I am benchmarking on a Skylake i7-6700K
at 4.0GHz.

Thanks,

Richard

Possible suboptimal code generated in 32-bit ABI mode

Reply via email to