G3 cores...

David Jander Mon, 25 Aug 2008 06:07:38 -0700

Hi Matt,

On Monday 25 August 2008 13:00:10 Matt Sealey wrote:
> The focus has definitely been on VMX but that's not to say lower power
> processors were forgotten :)


lower-power (pun intended) is coming strong these days, as energy-efficiency 
is getteing more important every day. And the MPC5121 is a brand-new embedded 
processor, that will pop-up in quite a lot devices around you most 
probably ;-)

> Gunnar von Boehn did some benchmarking with an assembly optimized routine,
> for Cell, 603e and so on (basically the whole gamut from embedded up to
> sever class IBM chips) and got some pretty good results;
>
> http://www.powerdeveloper.org/forums/viewtopic.php?t=1426
>
> It is definitely something that needs fixing. The generic routine in glibc
> just copies words with no benefit of knowing the cache line size or any
> cache block buffers in the chip, and certainly no use of cache control or
> data streaming on higher end chips.
>
> With knowledge of the right way to unroll the loops, how many copies to
> do at once to try and get a burst, reducing cache usage etc. you can get
> very impressive performance (as you can see, 50MB up to 78MB at the
> smallest size, the basic improvement is 2x performance).
>
> I hope that helps you a little bit. Gunnar posted code to this list not
> long after. I have a copy of the "e300 optimized" routine but I thought
> best he should post it here, than myself.

Ok, I think I found it on the thread. The only problem is, that AFAICS it can 
be much better... at least on my platform (e300 core), and I don't know why! 
Can you explain this?

I did this:

I took Gunnars code (copy-paste from the forum), renamed the function from 
memcpy_e300 to memcpy and put it in a file called "memcpy_e300.S". Then I 
did:

$ gcc -O2 -Wall -shared -o libmemcpye300.so memcpy_e300.S

I tried the performance with the small program in the attachment:

$ gcc -O2 -Wall -o pruvmem pruvmem.c
$ LD_PRELOAD=..../libmemcpye300.so ./pruvmem

Data rate:  45.9 MiB/s

Now I did the same thing with my own memcpy written in C (see attached file 
mymemcpy.c):

$ LD_PRELOAD=..../libmymemcpy.so ./pruvmem

Data rate:  72.9 MiB/s

Now, can someone please explain this?

As a reference, here's glibc's performance:

$ ./pruvmem

Data rate:  14.8 MiB/s

> There is a lot of scope I think for optimizing several points (glibc,
> kernel, some applications) for embedded processors which nobody is
> really taking on. But, not many people want to do this kind of work..

They should! It makes a HUGE difference. I surely will of course.

Greetings,

-- 
David Jander

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(void)
{
        int f;
        unsigned long int *mem,*src,*dst;
        int t;
        long int usecs;
        unsigned long int secs, count;
        double rate;
        struct timeval tv, tv0, tv1;

        printf("Opening fb0\n");
        f = open("/dev/fb0", O_RDWR);
        if(f<0) {
                perror("opening fb0");
                return 1;
        }
        printf("mmapping fb0\n");

        mem = mmap(NULL, 0x00300000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,f,0);

        printf("mmap returned: %08x\n",(unsigned int)mem);
        perror("mmap");
        if(mem==-1)
                return 1;

        gettimeofday(&tv, NULL);
        for(t=0; t<0x000c0000; t++)
                mem[t] = (tv.tv_usec ^ tv.tv_sec) ^ t;
        count = 0;
        gettimeofday(&tv0, NULL);
        for(t=0; t<10; t++) {
                src = mem;
                dst = mem+0x00040000;
                memcpy(dst, src, 0x00100000);
                count += 0x00100000;
        }
        gettimeofday(&tv1, NULL);
        secs = tv1.tv_sec-tv0.tv_sec;
        usecs = tv1.tv_usec-tv0.tv_usec;
        if(usecs<0) {
                usecs += 1000000;
                secs -= 1;
        }
        printf("Time elapsed: %ld secs, %ld usecs data transferred: %ld bytes\n",secs, usecs, count);
        rate = (double)count/((double)secs + (double)usecs/1000000.0);
        printf("Data rate: %5.3g MiB/s\n", rate/(1024.0*1024.0));

        return 0;
}

#include <stdlib.h>
void * memcpy(void * dst, void const * src, size_t len)
{
        unsigned long int a,b,c,d;
        unsigned long int a1,b1,c1,d1;
        unsigned long int a2,b2,c2,d2;
        unsigned long int a3,b3,c3,d3;
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;
    //if (!((unsigned long)src & 0xFFFFFFFC) && !((unsigned long)dst & 0xFFFFFFFC))
    //{
        while (len >= 64)
        {
                        a =  plSrc[0];
                        b =  plSrc[1];
                        c =  plSrc[2];
                        d =  plSrc[3];
                        a1 = plSrc[4];
                        b1 = plSrc[5];
                        c1 = plSrc[6];
                        d1 = plSrc[7];
                        a2 = plSrc[8];
                        b2 = plSrc[9];
                        c2 = plSrc[10];
                        d2 = plSrc[11];
                        a3 = plSrc[12];
                        b3 = plSrc[13];
                        c3 = plSrc[14];
                        d3 = plSrc[15];
                        plSrc += 16;
                        plDst[0] = a;
                        plDst[1] = b;
                        plDst[2] = c;
                        plDst[3] = d;
                        plDst[4] = a1;
                        plDst[5] = b1;
                        plDst[6] = c1;
                        plDst[7] = d1;
                        plDst[8] = a2;
                        plDst[9] = b2;
                        plDst[10] = c2;
                        plDst[11] = d2;
                        plDst[12] = a3;
                        plDst[13] = b3;
                        plDst[14] = c3;
                        plDst[15] = d3;
                        plDst += 16;
            len -= 64;
        }
        while(len >= 16) {
            a =  plSrc[0];
            b =  plSrc[1];
            c =  plSrc[2];
            d =  plSrc[3];
            plSrc += 4;
            plDst[0] = a;
            plDst[1] = b;
            plDst[2] = c;
            plDst[3] = d;
            plDst += 4;
            len -= 16;
        }
    //}
    char * pcDst = (char *) plDst;
    char const * pcSrc = (char const *) plSrc;

    while (len--)
    {
        *pcDst++ = *pcSrc++;
    }
    return (dst);
}

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: Efficient memcpy()/memmove() for G2/G3 cores...

Reply via email to