Hi Richard! On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther <richard.guent...@gmail.com> wrote: > On Thu, Sep 8, 2011 at 12:31 AM, Steve White > <stevan.wh...@googlemail.com> wrote: >> Hi, >> >> I run some tests of simple number-crunching loops whenever new >> architectures and compilers arise. >> >> These tests on recent Intel architectures show similar performance >> between gcc and icc compilers, at full optimization. >> >> However a recent test on x86_64 showed the open64 compiler >> outstripping gcc by a factor of 2 to 3. I tried all the obvious >> flags; nothing helped. > > Like -funroll-loops? >
** Let's turn it around: What are a good set of flags then for improving speed in simple loops such as these on the x86_64? In fact, I did try -funroll-loops and several others, but I somehow fooled myself (Maybe partly because, as I wrote, I was under the impression -O3 turned this on by default.) With -funroll-loops, the performance is improved a lot. $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c $ ./a.out double array mults by const 320 ms [ 1.013193] Which puts it only a factor of 2 slower than the open64 -O3. Furthermore, -march=native improves it yet more. $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic mults_by_const.c $ ./a.out double array mults by const 300 ms [ 1.013193] Now it's only 70% slower than the open64 results. I tried these flags -floop-optimize -fmove-loop-invariants -fprefetch-loop-arrays -fprofile-use but saw no further improvements. So I drop my claim of knowing what the problem is (and repent of even having tried before.) Simple searches on the web turn up a lot of experiments, nothing definitive. FWIW, also attached is the whole assembler file generated with the above settings. To my eye, the gcc assembler is a great deal more complicated, and does a lot more stuff, besides being slower. Thanks!
mults_by_const.s.gz
Description: GNU Zip compressed data