Re: [xine-user] [ANN] PowerPC Assembly Patch

Albert D. Cahalan Mon, 27 May 2002 02:21:03 -0500

On May 25 2002, Andrew Patrikalakis wrote:

> With all the recent talk of use of assembly on the PowerPC, I came
> up with a patch to use assembly versions of memcpy. It's about 35%
> faster. Here is a sample of the memcpy speed test (which also now
> works):


You're going to die laughing. I beat this by 8% in plain C,
without using 64-bit at all. :-) It's kind of portable even.

> Benchmarking memcpy methods (smaller is better):
>       glibc memcpy() : 136
>       ppcasm_memcpy() : 137
>       ppcasm_cacheable_memcpy() : 88
> xine: using ppcasm_cacheable_memcpy()
> (The lower time resolution is because I'm using times(NULL) in rdtsc())

This is in MB/s on a 450MHz MPC7400, with a granularity of 8.
You could mostly ignore the low scores, assuming that the
code got unlucky with something. I recompiled for every test,
and did a bit of web browsing in between every few tests.
I was careful to initialize the data first; failure to do
so would mean reading from the zero page.

glibc:    96, 96,104,104,104,104,104,112,112,112,112
kernel:  104,104,104,112,120,120,128,128,128,128,128
c2_flt:  112,120,120,120,120,120,128,136,144,144,144
c_flt:    88, 88, 88, 96,104,104,104,112,112,112,112
c_dbl:   152,152,152,152,152,168,168,168,168,168,184
c2_dbl:  120,136,144,144,152,152,152,160,160,160,168

glibc is just that
kernel is the assembly code that was posted
c2_flt is the code below
c_flt is like c2_flt, but normal 0,1,2,3,4,5... order
c2_dbl is like c2_flt, but with type "double"
c_dbl is like c_flt, but with type "double"

For the old bus, decimal MB/s copied should be 3.2 times the
bus speed if you don't count both loads and stores. If you
have a "G4" on the Max bus, it should be 4x bus speed minus
a tiny bit of overhead for occasional load/store turnaround.
So unless something is wrong with Mac motherboards, none of
these methods are anywhere near the limit.

Command line:
gcc -Wall -O2 mem.c kern.S && ./a.out

gcc version:
Reading specs from /usr/lib/gcc-lib/powerpc-linux/2.95.4/specs
gcc version 2.95.4 20011006 (Debian prerelease)

////////////////////////////////////////////////////////////////////////
static void c2_flt_memcpy(void *dst, const void *src, size_t n){
    float r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,ra,rb,rc,rd,re,rf;
    int i=n/(16*4);    /* 16 is loop unroll factor, 4 is sizeof float */
    float *sp = (float*)src - 16;
    float *dp = (float*)dst - 16;
    while(i--){
      sp += 16;
      r0 = sp[0];
      r8 = sp[8];
      r1 = sp[1];
      r9 = sp[9];
      r2 = sp[2];
      ra = sp[10];
      r3 = sp[3];
      rb = sp[11];
      r4 = sp[4];
      rc = sp[12];
      r5 = sp[5];
      rd = sp[13];
      r6 = sp[6];
      re = sp[14];
      r7 = sp[7];
      rf = sp[15];
      dp += 16;
      dp[ 0] = r0;
      dp[ 8] = r8;
      dp[ 1] = r1;
      dp[ 9] = r9;
      dp[ 2] = r2;
      dp[10] = ra;
      dp[ 3] = r3;
      dp[11] = rb;
      dp[ 4] = r4;
      dp[12] = rc;
      dp[ 5] = r5;
      dp[13] = rd;
      dp[ 6] = r6;
      dp[14] = re;
      dp[ 7] = r7;
      dp[15] = rf;
    }
}
////////////////////////////////////////////////////////////////////////


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Re: [xine-user] [ANN] PowerPC Assembly Patch

Reply via email to