On Thu, Sep 8, 2011 at 3:20 PM, Richard Guenther
<richard.guent...@gmail.com> wrote:
> On Thu, Sep 8, 2011 at 3:09 PM, Steve White <stevan.wh...@googlemail.com> 
> wrote:
>> Hi Richard!
>>
>> On Thu, Sep 8, 2011 at 11:02 AM, Richard Guenther
>> <richard.guent...@gmail.com> wrote:
>>> On Thu, Sep 8, 2011 at 12:31 AM, Steve White
>>> <stevan.wh...@googlemail.com> wrote:
>>>> Hi,
>>>>
>>>> I run some tests of simple number-crunching loops whenever new
>>>> architectures and compilers arise.
>>>>
>>>> These tests on recent Intel architectures show similar performance
>>>> between gcc and icc compilers, at full optimization.
>>>>
>>>> However a recent test on x86_64 showed the open64 compiler
>>>> outstripping gcc by a factor of 2 to 3.  I tried all the obvious
>>>> flags; nothing helped.
>>>
>>> Like -funroll-loops?
>>>
>>
>> ** Let's turn it around:  What are a good set of flags then for
>> improving speed in simple loops such as these on the x86_64?
>>
>> In fact, I did try -funroll-loops and several others, but I somehow
>> fooled myself (Maybe partly because, as I wrote, I was under the
>> impression -O3 turned this on by default.)
>>
>> With -funroll-loops, the performance is improved a lot.
>>
>> $ gcc --std=c99 -O3 -funroll-loops -Wall -pedantic mults_by_const.c
>> $ ./a.out
>> double array mults by const             320 ms [  1.013193]
>>
>> Which puts it only a factor of 2 slower than the open64 -O3.
>>
>> Furthermore, -march=native improves it yet more.
>>
>> $ gcc --std=c99 -O3 -funroll-loops -march=native -Wall -pedantic
>> mults_by_const.c
>> $ ./a.out
>> double array mults by const             300 ms [  1.013193]
>>
>> Now it's only 70% slower than the open64 results.
>>
>> I tried these flags
>>   -floop-optimize  -fmove-loop-invariants -fprefetch-loop-arrays 
>> -fprofile-use
>> but saw no further improvements.
>>
>> So I drop my claim of knowing what the problem is (and repent of even
>> having tried before.)
>>
>> Simple searches on the web turn up a lot of experiments, nothing definitive.
>>
>> FWIW, also attached is the whole assembler file generated with the
>> above settings.
>>
>> To my eye, the gcc assembler is a great deal more complicated, and
>> does a lot more stuff, besides being slower.
>
> opencc exchanged the loops
>
>        for( j = 0; j < ITERATIONS; j++ )
>                for( i = 0; i < size; i++ )
>                        dvec[i] *= dval;
>
> to
>
>                for( i = 0; i < size; i++ )
>        for( j = 0; j < ITERATIONS; j++ )
>                        dvec[i] *= dval;
>
> and then applies store-motion to end up with
>
>                for( i = 0; i < size; i++ )
> {
>   double tem = dvec[i];
>        for( j = 0; j < ITERATIONS; j++ )
>            tem *= dval;
>   dvec[i] = tem;
> }
>
> that's obviously better for the cache.  GCC can do the same
> when you enable -ftree-loop-linear but then it confuses itself
> enough to no longer vectorize the loop.

I opened http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50328 and
put in a patch to get the loop vectorized.  But opencc unrolls the
outer loop which we can't yet do.

Richard.

Reply via email to