William Stein wrote:
> On Sun, Dec 27, 2009 at 4:05 PM, Dr David Kirkby <drkir...@gmail.com> wrote:
>>
>> On Dec 27, 8:16 pm, mhampton <hampto...@gmail.com> wrote:
>>> Seems about 8 times slower at some basic tests I have compared to
>>> sagenb.
> 
> What basic tests?     Big integer arithmetic  -- which relies on MPIR
> -- is often an order of magnitude slower, because of lack of good
> assembly optimization for Sparc.  It probably *could* be much faster,
> if somebody were to invest in supporting a developer to write code to
> make it fast.
> 
> David -- I don't get how you can have a 30x slowdown with building
> unless you're building in /home instead of /scratch, so that the true
> slowdown is the crappy filesystem.

William,

I think we really need a Sage benchmark written which tests various aspects - 
single threaded, multi-threaded, integer, machine precision etc. But in the 
absence of that, I'm attaching two C programs which compute prime_pi(1000000) 
in 
the brute-force method even I can understand! No fancy number theory here.

These were written by Andrew Gabriel at Sun, when defending the performance of 
the T5240. (To be fair to Andrew, he does not normally write C code with these 
sort of variable names, but was doing so to replicate a program posted by 
someone else.)

I've called the two versions serial.c and parallel.c. The only change from 
Andrew's code is to increase N from 100,000 to 1,000,000

In each case, they were compiled with gcc -O3. The gcc version was either 4.3.3 
or 4.3.4

Times are all quoted as minutes:seconds. There is data for three different 
Solaris machines and one HP-UX machine.

1) 't2' (SPARC)

T5240 16 cores, 128 threads, 1167 MHz
serial version : 20:18.411
parallel version: 0:16.430

That's speedup (serial time / parallel time) of 74x on 't2'.

2) My own Sun Ultra 27 (Xeon) Made in 2009

Sun Ultra 27 (3.333 GHz Intel Xeon, 4 cores, 8 threads)
Serial version : 0:58.675
Parallel version: 0:15.851s

That's a speedup of 3.7x times on my Ultra 27.

3) Sun Blade 2000 of mine (made in 2002)

Sun Blade 2000 2 x 1200 MHz
Serial version  15m7.146s
Parallel version 8m27.613s

That's a speedup of 1.8x on this dual processor Sun Blade 2000.

4) HP C3600 (made in 2000)
1 x 552 MHz
single 1h:02m:31.1s
parallel - N/A

The HP C3600 has only one CPU, so there is no point measuring the parallel 
performance.

The main points for this one integer task.

1) The single threaded performance is of my Ultra 27 is 20.8 x faster than 
't2'!!

So Marshall Hampton's statement about 't2' "Seems about 8 times slower at some 
basic tests I have compared to sagenb." does not surprise me at all.

That is not quite as much as the 30x speed differential when building code, but 
a very significant difference, showing just how bad 't2' is for some tasks.

2) My Blade 2000, built in 2002, still has a single-threaded integer 
performance 
which is better than 't2'.

3) The performance of 't2' was increased by a factor of 74 by exploiting it 
better.

4) The parallel performance of 't2' is almost identical to the parallel 
performance of my Ultra 27. The Ultra 27 is clocked a lot faster, but has fare 
less cores and threads. The two seem to balance each other.

5) On this task, the 10-year old single processor HP C3600 is the slowest, even 
slower than 't2'.

I know this is not a very conclusive test, but I think it shows just how bad 
't2' is for some tasks. Unless you are going to get 100 simultaneous users on 
that machine, I suspect it will always be poorer than other machines.

Without testing it, I don't know what 't2's floating point performance is like, 
but I would suspect it is very poor, as floating point performance is not 
needed 
in the market the T5240 is aimed at.

I'd be interested in the times for some other machines, but really we do need a 
Sage benchmark to get some idea of how good/bad a machine is for running Sage.

Dave

-- 
To post to this group, send an email to sage-devel@googlegroups.com
To unsubscribe from this group, send an email to 
sage-devel+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sage-devel
URL: http://www.sagemath.org
#include <stdio.h>
int
main(int argc, char *argv[])
{
        int I1;
        int I2 = 1000000;
        int I3;
        int I4;
        int I5;
        int I6 = 0;
        int I7;

        printf("N primes up to ");
        printf("%d", I2);
        printf(" is: ");
        for (I1 = 1; I1 <= I2; I1++) {
                I4 = I1 / 2;
                for (I3 = 2; I3 <= I4; I3++) {
                        I5 = I1 % I3;
                        if (!I5)
                                break;
                }
                if (I3 > I4) {
                        I6++;
                        I7 = I1;
                }
        }
        printf("%d\n", I6);
        printf("last is: %d\n", I7);
        return 0;

} 

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>

#define MAXTHREADS 1024

struct args {
        int start;
        int end;

};

static struct args threadargs[MAXTHREADS];

static pthread_mutex_t mux;
static int threads;
static int global_I2 = 1000000;
static int global_I6 = 0;
static int global_I7 = 0;

static void *
primes(void *arg)
{
        struct args *args = (struct args *)arg;
        int I1 = args->start;
        int I2 = args->end;
        int I3;
        int I4;
        int I5;
        int I6 = 0;
        int I7;

        for (; I1 <= I2; I1++) {
                I4 = I1 / 2;
                for (I3 = 2; I3 <= I4; I3++) {
                        I5 = I1 % I3;
                        if (!I5)
                                break;
                }
                if (I3 > I4) {
                        I6++;
                        I7 = I1;
                }
        }

        /* We're done. Update globals with our results */
        pthread_mutex_lock(&mux);
        global_I6 += I6;
        if (I7 > global_I7)
                global_I7 = I7;
        if (--threads > 0) {
                /* Other threads still running */
                pthread_mutex_unlock(&mux);
                pthread_detach(pthread_self()); /* no zombie */
                pthread_exit(NULL);             /* goodbye */
                /*NOTREACHED*/
        }
        pthread_mutex_unlock(&mux);

        /* We happen to be the last thread to finish - report results */

        printf("N primes up to ");
        printf("%d", global_I2);
        printf(" is: ");
        printf("%d\n", global_I6);
        printf("last is: %d\n", global_I7);

        exit(0);

}

int
main(int argc, char *argv[])
{
        int i;

        if (argc > 1)
                global_I2 = atoi(argv[1]);

        if (argc > 2)
                threads = atoi(argv[2]);
        else
                threads = sysconf(_SC_NPROCESSORS_ONLN) * 2;
                /* default threads to twice the number of CPUs */

        if (threads > MAXTHREADS)
                threads = MAXTHREADS;

        if (threads > global_I2 / 10)        /* that's just silly */
                threads = global_I2 / 10;

        printf("Using %d threads\n", threads);

        pthread_mutex_init(&mux, NULL);

        /* setup threadargs array */
        for (i = 0; i < threads; i++) {
                /* split the range up across all the threads */
                threadargs[i].start = (global_I2 / threads * i) + 1;
                threadargs[i].end = global_I2 / threads * (i + 1);
        }
        /* correct any rounding error on last one */
        threadargs[threads-1].end = global_I2;

        /* start all the threads */
        pthread_mutex_lock(&mux);
        for (i = 0; i < threads; i++) {
                pthread_t tid;

                errno = pthread_create(&tid, NULL, primes, &threadargs[i]);
                if (errno != 0) {
                        perror("pthread_create");
                        exit(1);
                }
        }
        pthread_mutex_unlock(&mux);

        pthread_detach(pthread_self()); /* no zombie */
        pthread_exit(NULL);             /* main thread finished - goodbye */
        /*NOTREACHED*/

} 

Reply via email to