On Thu, Jul 24, 2008 at 1:03 AM, Agner Fog <[EMAIL PROTECTED]> wrote:
> Dennis Clarke wrote:
>>The Sun Studio 12 compiler with Solaris 10 on AMD Opteron or
>>UltraSparc beats GCC in almost every single test case that I have
>>seen.
>
> This is memcpy on Solaris:
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/i386/gen/memcpy.s
>
> It uses exactly the same method as memcpy on gcc libc, with only minor
> differences that have no influence on performance.

There is a more optimized version for 64-bit:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/amd64/gen/memcpy.s

I think this looks similar to your implementation, Agner.

-raksit

>
>> Also, you have provided no data at all.
>
> I have linked to the data rather than copying it here to save space on the
> mailing list. Here is the link again:
> http://www.agner.org/optimize/optimizing_cpp.pdf  section 2.6, page 12.
>
>> So your assertions are those of a marketing person at the moment.
>
> Who sounds like a marketing person, you or me? :-)
>
>> Please post some code that can be compiled and then tested with high
>> resolution timers and perhaps
>> we can compare notes.
>
> Here is my code, again:
> http://www.agner.org/optimize/asmlib.zip
> My test results, referred to above, uses the "core clock cycles" performance
> counter on Intel and RDTSC on AMD. It's the highest resolution you can get.
> Feel free to do you own tests, it's as simple as linking my library into
> your test program.
>
> Tim Prince wrote:
>>you identify the library you tested only as "ubuntu g++ 4.2.3."
> Where can I see the libc version?
>
>>The corresponding 64-bit linux will see vastly different levels of
>> performance, depending on the
>>glibc version, as it doesn't use a builtin string move.
> Yes, this is exactly what my tests show. 64-bit libc is better than 32-bit
> libc, but still 3-4 times slower than the best library for unaligned
> operands on an Intel.
>
>>Certain newer CPUs aim to improve performance of the 32-bit gcc builtin
>> string moves, but don't
>> entirely eliminate the situations where it isn't optimum.
>
> The Intel manuals are not clear about this. Intel Optimization reference
> manual says:
>>In most cases, applications should take advantage of the default memory
>> routines provided by Intel compilers.
> What an excellent advice - the Intel compiler puts in a library with an
> automatic run-slowly-on-AMD feature!
> The Intel library does not use rep movs when running on an Intel CPU.
>
> The AMD software optimization guide mentions specific situations where rep
> movs is optimal. However, my tests on an Opteron (K8) tell that rep movs is
> never optimal on AMD either. I have no access to test it on the new AMD K10,
> but I expect the XMM register code to run much faster on K10 than on K8
> because K10 has 128-bit data paths where K8 has only 64-bit.
>
> Evidently, the problem with memcpy has been ignored for years, see
> http://softwarecommunity.intel.com/Wiki/Linux/719.htm
>
>

Reply via email to