On Thu, Sep 8, 2011 at 3:20 PM, Richard Guenther <richard.guent...@gmail.com> wrote: > On Thu, Sep 8, 2011 at 3:09 PM, Steve White <stevan.wh...@googlemail.com> > wrote: >> Hi Richard! >> >> On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther >> <richard.guent...@gmail.com> wrote: >>> On Thu, Sep 8, 2011 at 12:31 AM, Steve White >>> <stevan.wh...@googlemail.com> wrote: >>>> Hi, >>>> >>>> I run some tests of simple number-crunching loops whenever new >>>> architectures and compilers arise. >>>> >>>> These tests on recent Intel architectures show similar performance >>>> between gcc and icc compilers, at full optimization. >>>> >>>> However a recent test on x86_64 showed the open64 compiler >>>> outstripping gcc by a factor of 2 to 3. I tried all the obvious >>>> flags; nothing helped. >>> >>> Like -funroll-loops? >>> >> >> ** Let's turn it around: What are a good set of flags then for >> improving speed in simple loops such as these on the x86_64? >> >> In fact, I did try -funroll-loops and several others, but I somehow >> fooled myself (Maybe partly because, as I wrote, I was under the >> impression -O3 turned this on by default.) >> >> With -funroll-loops, the performance is improved a lot. >> >> $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c >> $ ./a.out >> double array mults by const 320 ms [ 1.013193] >> >> Which puts it only a factor of 2 slower than the open64 -O3. >> >> Furthermore, -march=native improves it yet more. >> >> $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic >> mults_by_const.c >> $ ./a.out >> double array mults by const 300 ms [ 1.013193] >> >> Now it's only 70% slower than the open64 results. >> >> I tried these flags >> -floop-optimize -fmove-loop-invariants -fprefetch-loop-arrays >> -fprofile-use >> but saw no further improvements. >> >> So I drop my claim of knowing what the problem is (and repent of even >> having tried before.) >> >> Simple searches on the web turn up a lot of experiments, nothing definitive. >> >> FWIW, also attached is the whole assembler file generated with the >> above settings. >> >> To my eye, the gcc assembler is a great deal more complicated, and >> does a lot more stuff, besides being slower. > > opencc exchanged the loops > > for( j = 0; j < ITERATIONS; j++ ) > for( i = 0; i < size; i++ ) > dvec[i] *= dval; > > to > > for( i = 0; i < size; i++ ) > for( j = 0; j < ITERATIONS; j++ ) > dvec[i] *= dval; > > and then applies store-motion to end up with > > for( i = 0; i < size; i++ ) > { > double tem = dvec[i]; > for( j = 0; j < ITERATIONS; j++ ) > tem *= dval; > dvec[i] = tem; > } > > that's obviously better for the cache. GCC can do the same > when you enable -ftree-loop-linear but then it confuses itself > enough to no longer vectorize the loop.
I opened http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50328 and put in a patch to get the loop vectorized. But opencc unrolls the outer loop which we can't yet do. Richard.