On May 25 2002, Andrew Patrikalakis wrote: > With all the recent talk of use of assembly on the PowerPC, I came > up with a patch to use assembly versions of memcpy. It's about 35% > faster. Here is a sample of the memcpy speed test (which also now > works):
You're going to die laughing. I beat this by 8% in plain C, without using 64-bit at all. :-) It's kind of portable even. > Benchmarking memcpy methods (smaller is better): > glibc memcpy() : 136 > ppcasm_memcpy() : 137 > ppcasm_cacheable_memcpy() : 88 > xine: using ppcasm_cacheable_memcpy() > (The lower time resolution is because I'm using times(NULL) in rdtsc()) This is in MB/s on a 450MHz MPC7400, with a granularity of 8. You could mostly ignore the low scores, assuming that the code got unlucky with something. I recompiled for every test, and did a bit of web browsing in between every few tests. I was careful to initialize the data first; failure to do so would mean reading from the zero page. glibc: 96, 96,104,104,104,104,104,112,112,112,112 kernel: 104,104,104,112,120,120,128,128,128,128,128 c2_flt: 112,120,120,120,120,120,128,136,144,144,144 c_flt: 88, 88, 88, 96,104,104,104,112,112,112,112 c_dbl: 152,152,152,152,152,168,168,168,168,168,184 c2_dbl: 120,136,144,144,152,152,152,160,160,160,168 glibc is just that kernel is the assembly code that was posted c2_flt is the code below c_flt is like c2_flt, but normal 0,1,2,3,4,5... order c2_dbl is like c2_flt, but with type "double" c_dbl is like c_flt, but with type "double" For the old bus, decimal MB/s copied should be 3.2 times the bus speed if you don't count both loads and stores. If you have a "G4" on the Max bus, it should be 4x bus speed minus a tiny bit of overhead for occasional load/store turnaround. So unless something is wrong with Mac motherboards, none of these methods are anywhere near the limit. Command line: gcc -Wall -O2 mem.c kern.S && ./a.out gcc version: Reading specs from /usr/lib/gcc-lib/powerpc-linux/2.95.4/specs gcc version 2.95.4 20011006 (Debian prerelease) //////////////////////////////////////////////////////////////////////// static void c2_flt_memcpy(void *dst, const void *src, size_t n){ float r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,ra,rb,rc,rd,re,rf; int i=n/(16*4); /* 16 is loop unroll factor, 4 is sizeof float */ float *sp = (float*)src - 16; float *dp = (float*)dst - 16; while(i--){ sp += 16; r0 = sp[0]; r8 = sp[8]; r1 = sp[1]; r9 = sp[9]; r2 = sp[2]; ra = sp[10]; r3 = sp[3]; rb = sp[11]; r4 = sp[4]; rc = sp[12]; r5 = sp[5]; rd = sp[13]; r6 = sp[6]; re = sp[14]; r7 = sp[7]; rf = sp[15]; dp += 16; dp[ 0] = r0; dp[ 8] = r8; dp[ 1] = r1; dp[ 9] = r9; dp[ 2] = r2; dp[10] = ra; dp[ 3] = r3; dp[11] = rb; dp[ 4] = r4; dp[12] = rc; dp[ 5] = r5; dp[13] = rd; dp[ 6] = r6; dp[14] = re; dp[ 7] = r7; dp[15] = rf; } } //////////////////////////////////////////////////////////////////////// -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]