Dennis Clarke wrote:
>The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
>UltraSparc beats GCC in almost every single test case that I have
>seen.
This is memcpy on Solaris:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s
It uses exactly the same method as memcpy on gcc libc, with only minor
differences that have no influence on performance.
Also, you have provided no data at all.
I have linked to the data rather than copying it here to save space on
the mailing list. Here is the link again:
http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6, page 12.
So your assertions are those of a marketing person at the moment.
Who sounds like a marketing person, you or me? :-)
> Please post some code that can be compiled and then tested with high
resolution timers and perhaps
> we can compare notes.
Here is my code, again:
http://www.agner.org/optimize/asmlib.zip
My test results, referred to above, uses the "core clock cycles"
performance counter on Intel and RDTSC on AMD. It's the highest
resolution you can get. Feel free to do you own tests, it's as simple as
linking my library into your test program.
Tim Prince wrote:
>you identify the library you tested only as "ubuntu g++ 4.2.3."
Where can I see the libc version?
>The corresponding 64-bit linux will see vastly different levels of
performance, depending on the
>glibc version, as it doesn't use a builtin string move.
Yes, this is exactly what my tests show. 64-bit libc is better than
32-bit libc, but still 3-4 times slower than the best library for
unaligned operands on an Intel.
>Certain newer CPUs aim to improve performance of the 32-bit gcc
builtin string moves, but don't
> entirely eliminate the situations where it isn't optimum.
The Intel manuals are not clear about this. Intel Optimization reference
manual says:
>In most cases, applications should take advantage of the default
memory routines provided by Intel compilers.
What an excellent advice - the Intel compiler puts in a library with an
automatic run-slowly-on-AMD feature!
The Intel library does not use rep movs when running on an Intel CPU.
The AMD software optimization guide mentions specific situations where
rep movs is optimal. However, my tests on an Opteron (K8) tell that rep
movs is never optimal on AMD either. I have no access to test it on the
new AMD K10, but I expect the XMM register code to run much faster on
K10 than on K8 because K10 has 128-bit data paths where K8 has only 64-bit.
Evidently, the problem with memcpy has been ignored for years, see
http://softwarecommunity.intel.com/Wiki/Linux/719.htm