On Thu, Jul 24, 2008 at 1:03 AM, Agner Fog <[EMAIL PROTECTED]> wrote: > Dennis Clarke wrote: >>The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or >>UltraSparc beats GCC in almost every single test case that I have >>seen. > > This is memcpy on Solaris: > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s > > It uses exactly the same method as memcpy on gcc libc, with only minor > differences that have no influence on performance.
There is a more optimized version for 64-bit: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s I think this looks similar to your implementation, Agner. -raksit > >> Also, you have provided no data at all. > > I have linked to the data rather than copying it here to save space on the > mailing list. Here is the link again: > http://www.agner.org/optimize/optimizing_cpp.pdf section 2.6, page 12. > >> So your assertions are those of a marketing person at the moment. > > Who sounds like a marketing person, you or me? :-) > >> Please post some code that can be compiled and then tested with high >> resolution timers and perhaps >> we can compare notes. > > Here is my code, again: > http://www.agner.org/optimize/asmlib.zip > My test results, referred to above, uses the "core clock cycles" performance > counter on Intel and RDTSC on AMD. It's the highest resolution you can get. > Feel free to do you own tests, it's as simple as linking my library into > your test program. > > Tim Prince wrote: >>you identify the library you tested only as "ubuntu g++ 4.2.3." > Where can I see the libc version? > >>The corresponding 64-bit linux will see vastly different levels of >> performance, depending on the >>glibc version, as it doesn't use a builtin string move. > Yes, this is exactly what my tests show. 64-bit libc is better than 32-bit > libc, but still 3-4 times slower than the best library for unaligned > operands on an Intel. > >>Certain newer CPUs aim to improve performance of the 32-bit gcc builtin >> string moves, but don't >> entirely eliminate the situations where it isn't optimum. > > The Intel manuals are not clear about this. Intel Optimization reference > manual says: >>In most cases, applications should take advantage of the default memory >> routines provided by Intel compilers. > What an excellent advice - the Intel compiler puts in a library with an > automatic run-slowly-on-AMD feature! > The Intel library does not use rep movs when running on an Intel CPU. > > The AMD software optimization guide mentions specific situations where rep > movs is optimal. However, my tests on an Opteron (K8) tell that rep movs is > never optimal on AMD either. I have no access to test it on the new AMD K10, > but I expect the XMM register code to run much faster on K10 than on K8 > because K10 has 128-bit data paths where K8 has only 64-bit. > > Evidently, the problem with memcpy has been ignored for years, see > http://softwarecommunity.intel.com/Wiki/Linux/719.htm > >