------- Comment #8 from ubizjak at gmail dot com 2008-03-22 11:01 ------- (In reply to comment #6) > As Uros has "challenged me to beat performance of gcc-4.4 generated code by > hand-crafted assembly using the example of PR 21395" heres my entry, sadly i > only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code > than 4.4 so i guess thats ok its inner loop is:
Not! This is the comparison of runtimes for the original test, comparing 4.3.0 vs 4.4.0 compiled code on core2D EE: $ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp $ time ./a.out 144 real 0m0.619s user 0m0.620s sys 0m0.000s $ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp $ time ./a.out 144 real 0m0.398s user 0m0.400s sys 0m0.000s gcc 4.4.0 with your modified computation kernel: $ g++ -m32 -march=core2 -O2 mmx-1.cpp $ time ./a.out 144 real 0m0.309s user 0m0.308s sys 0m0.000s To be honest, I didn't expect you to completely rewrite the computation kernel, so we are comparing apples to oranges. However, you can rewrite your ASM code using intrinsic functions from __mmintrin.h, and you will get all optimizations (scheduling, unrolling, etc) for free, while you are still in control of code generation on a fairly low level. Using intrinsics, you leave to the compiler things that the compiler is good at (loop handling, register allocation, scheduling). Are you interested in this experiment? The results of this experiment would perhaps be interesting to ffmpeg people to consider rewriting their asm blocks into intrinsics. And really thanks for your detailed benchmark results! And since your computation kernel is already 30% faster than current implementation, I'm sure that Dirac people (in CC of this PR) will be very interested in your computational kernel. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395