------- Comment #9 from michaelni at gmx dot at  2008-03-23 02:49 -------
Subject: Re:  Performance degradation when
        building code that uses MMX intrinsics with gcc-4.0.0

On Sat, Mar 22, 2008 at 11:01:55AM -0000, ubizjak at gmail dot com wrote:
> 
> 
> ------- Comment #8 from ubizjak at gmail dot com  2008-03-22 11:01 -------
> (In reply to comment #6)
> > As Uros has "challenged me to beat performance of gcc-4.4 generated code by
> > hand-crafted assembly using the example of PR 21395" heres my entry, sadly i
> > only have gcc-4.3 compiled ATM for comparission but 4.3 generates better 
> > code
> > than 4.4 so i guess thats ok its inner loop is:
> 
> Not!
> 
> This is the comparison of runtimes for the original test, comparing 4.3.0 vs
> 4.4.0 compiled code on core2D EE:
> 
> $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp
> $ time ./a.out
> 144
> 
> real    0m0.619s
> user    0m0.620s
> sys     0m0.000s
> 
> $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp
> $ time ./a.out
> 144
> 
> real    0m0.398s
> user    0m0.400s
> sys     0m0.000s

On my duron with -O2 -mmmx i get
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real    0m2.077s
user    0m1.912s
sys     0m0.019s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
144

real    0m2.172s
user    0m2.004s
sys     0m0.021s


with -m32 -march=core2 (incorrect as doesnt match cpu!)
g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease)
144

real    0m3.644s
user    0m3.389s
sys     0m0.022s


g++-4.4 (GCC) 4.4.0 20080321 (experimental)
Illegal instruction         (yes yes i know i asked for it)

real    0m0.011s
user    0m0.003s
sys     0m0.007s


So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm.



> 
> gcc 4.4.0 with your modified computation kernel:
> 
> $ g++ -m32 -march=core2 -O2 mmx-1.cpp
> $ time ./a.out
> 144
> 
> real    0m0.309s
> user    0m0.308s
> sys     0m0.000s
> 
> To be honest, I didn't expect you to completely rewrite the computation 
> kernel,
> so we are comparing apples to oranges. 

Well nothing stops gcc from rewriting the intrinsics either :)


> However, you can rewrite your ASM code
> using intrinsic functions from __mmintrin.h, and you will get all 
> optimizations
> (scheduling, unrolling, etc) for free, while you are still in control of code
> generation on a fairly low level. Using intrinsics, you leave to the compiler
> things that the compiler is good at (loop handling, register allocation,
> scheduling).
> 
> Are you interested in this experiment? 

Iam surely interrested but iam a little busy with google summer of code
students currently. We have to choose wisely which applications and students
we select for ffmpeg this summer ... that means alot of code reviewing from
what the students submit as qualification tasks ...
So i wont rewrite this in intrinsics, at least not anytime soon.


> The results of this experiment would
> perhaps be interesting to ffmpeg people to consider rewriting their asm blocks
> into intrinsics.

well ...
Iam not a friend of intrinsics, but i think you guessed that already :)
The thing i like on asm() is that it produces the same performance and code
with every compiler. Its largely a write once and forget thing. A problem
with asm() is almost always of the compile time error sort like 
"cant find register in class blah" these things are vissible and can be dealt
with ...
With intrinsics its all a gamble, just look at this PR, how hugely performance
differs between gcc versions. If ffmpeg where using intrinsics instead of
asm we would have to spend considerable time dealing with such variations
somehow.


> 
> And really thanks for your detailed benchmark results! And since your
> computation kernel is already 30% faster than current implementation, I'm sure
> that Dirac people (in CC of this PR) will be very interested in your
> computational kernel.

yes, iam also fine with them using it under whichever FOSS license they want.

[...]


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395

Reply via email to