------- Comment #8 from ubizjak at gmail dot com  2008-03-22 11:01 -------
(In reply to comment #6)
> As Uros has "challenged me to beat performance of gcc-4.4 generated code by
> hand-crafted assembly using the example of PR 21395" heres my entry, sadly i
> only have gcc-4.3 compiled ATM for comparission but 4.3 generates better code
> than 4.4 so i guess thats ok its inner loop is:

Not!

This is the comparison of runtimes for the original test, comparing 4.3.0 vs
4.4.0 compiled code on core2D EE:

$ g++ -V 4.3.0 -m32 -march=core2 -O2 mmx.cpp
$ time ./a.out
144

real    0m0.619s
user    0m0.620s
sys     0m0.000s

$ g++ -V 4.4.0 -m32 -march=core2 -O2 mmx.cpp
$ time ./a.out
144

real    0m0.398s
user    0m0.400s
sys     0m0.000s

gcc 4.4.0 with your modified computation kernel:

$ g++ -m32 -march=core2 -O2 mmx-1.cpp
$ time ./a.out
144

real    0m0.309s
user    0m0.308s
sys     0m0.000s

To be honest, I didn't expect you to completely rewrite the computation kernel,
so we are comparing apples to oranges. However, you can rewrite your ASM code
using intrinsic functions from __mmintrin.h, and you will get all optimizations
(scheduling, unrolling, etc) for free, while you are still in control of code
generation on a fairly low level. Using intrinsics, you leave to the compiler
things that the compiler is good at (loop handling, register allocation,
scheduling).

Are you interested in this experiment? The results of this experiment would
perhaps be interesting to ffmpeg people to consider rewriting their asm blocks
into intrinsics.

And really thanks for your detailed benchmark results! And since your
computation kernel is already 30% faster than current implementation, I'm sure
that Dirac people (in CC of this PR) will be very interested in your
computational kernel.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395

Reply via email to