------- Comment #9 from michaelni at gmx dot at 2008-03-23 02:49 ------- Subject: Re: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0
On Sat, Mar 22, 2008 at 11:01:55AM -0000, ubizjak at gmail dot com wrote: > > > ------- Comment #8 from ubizjak at gmail dot com 2008-03-22 11:01 ------- > (In reply to comment #6) > > As Uros has "challenged me to beat performance of gcc-4.4 generated code by > > hand-crafted assembly using the example of PR 21395" heres my entry, sadly i > > only have gcc-4.3 compiled ATM for comparission but 4.3 generates better > > code > > than 4.4 so i guess thats ok its inner loop is: > > Not! > > This is the comparison of runtimes for the original test, comparing 4.3.0 vs > 4.4.0 compiled code on core2D EE: > > $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp > $ time ./a.out > 144 > > real 0m0.619s > user 0m0.620s > sys 0m0.000s > > $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp > $ time ./a.out > 144 > > real 0m0.398s > user 0m0.400s > sys 0m0.000s On my duron with -O2 -mmmx i get g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) 144 real 0m2.077s user 0m1.912s sys 0m0.019s g++-4.4 (GCC) 4.4.0 20080321 (experimental) 144 real 0m2.172s user 0m2.004s sys 0m0.021s with -m32 -march=core2 (incorrect as doesnt match cpu!) g++-4.3 (Debian 4.3.0-1) 4.3.1 20080309 (prerelease) 144 real 0m3.644s user 0m3.389s sys 0m0.022s g++-4.4 (GCC) 4.4.0 20080321 (experimental) Illegal instruction (yes yes i know i asked for it) real 0m0.011s user 0m0.003s sys 0m0.007s So on my duron 4.3 seems to beat 4.4 as i expected from the generated asm. > > gcc 4.4.0 with your modified computation kernel: > > $ g++ -m32 -march=core2 -O2 mmx-1.cpp > $ time ./a.out > 144 > > real 0m0.309s > user 0m0.308s > sys 0m0.000s > > To be honest, I didn't expect you to completely rewrite the computation > kernel, > so we are comparing apples to oranges. Well nothing stops gcc from rewriting the intrinsics either :) > However, you can rewrite your ASM code > using intrinsic functions from __mmintrin.h, and you will get all > optimizations > (scheduling, unrolling, etc) for free, while you are still in control of code > generation on a fairly low level. Using intrinsics, you leave to the compiler > things that the compiler is good at (loop handling, register allocation, > scheduling). > > Are you interested in this experiment? Iam surely interrested but iam a little busy with google summer of code students currently. We have to choose wisely which applications and students we select for ffmpeg this summer ... that means alot of code reviewing from what the students submit as qualification tasks ... So i wont rewrite this in intrinsics, at least not anytime soon. > The results of this experiment would > perhaps be interesting to ffmpeg people to consider rewriting their asm blocks > into intrinsics. well ... Iam not a friend of intrinsics, but i think you guessed that already :) The thing i like on asm() is that it produces the same performance and code with every compiler. Its largely a write once and forget thing. A problem with asm() is almost always of the compile time error sort like "cant find register in class blah" these things are vissible and can be dealt with ... With intrinsics its all a gamble, just look at this PR, how hugely performance differs between gcc versions. If ffmpeg where using intrinsics instead of asm we would have to spend considerable time dealing with such variations somehow. > > And really thanks for your detailed benchmark results! And since your > computation kernel is already 30% faster than current implementation, I'm sure > that Dirac people (in CC of this PR) will be very interested in your > computational kernel. yes, iam also fine with them using it under whichever FOSS license they want. [...] -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395