[Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64

burnus at gcc dot gnu.org Thu, 26 Sep 2013 00:27:21 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529


Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target
            Summary|Loop 30% faster with Intel  |GCC -funroll-loops 150%
                   |than with GCC               |slower with -march=native
                   |                            |on x86-64

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Tobias Burnus from comment #8)
> I have to re-check why unrolling made it slower on that Xeon E5-2630
> (comment 0) but faster on the i5.

Seems to be a tuning problem. All timings on the Xeon E5-2630, but using the
-march=native compile from the i5 vs. the -march=native compilation for the
Xeon E5:

real 1.530s  user 1.528s  sys 0.000s i5,   no unrolling
real 1.483s  user 1.481s  sys 0.000s Xeon, no unrolling
real 0.937s  user 0.934s  sys 0.002s i5,   -funroll-loops
real 2.480s  user 2.478s  sys 0.000s Xeon, -funroll-loops
real 0.935s  user 0.934s  sys 0.000s Xeon, -funroll-loops max-unroll-times=7

The i5's -march=native expands into:
-march=core-avx-i -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt  -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core-avx-i

The Xeon's -march=native
-march=corei7-avx -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase
-mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=corei7-avx

Namely:
i5:   -march=core-avx-i -mrdrnd    -mf16c    -mfsgsbase
      --param l2-cache-size=6144  -mtune=core-avx-i
Xeon: -march=corei7-avx -mno-rdrnd -mno-f16c -mno-fsgsbase
      --param l2-cache-size=15360 -mtune=corei7-avx

[Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64

Reply via email to