William Stein wrote: > On Sun, Dec 27, 2009 at 4:05 PM, Dr David Kirkby <drkir...@gmail.com> wrote: >> >> On Dec 27, 8:16 pm, mhampton <hampto...@gmail.com> wrote: >>> Seems about 8 times slower at some basic tests I have compared to >>> sagenb. > > What basic tests? Big integer arithmetic -- which relies on MPIR > -- is often an order of magnitude slower, because of lack of good > assembly optimization for Sparc. It probably *could* be much faster, > if somebody were to invest in supporting a developer to write code to > make it fast. > > David -- I don't get how you can have a 30x slowdown with building > unless you're building in /home instead of /scratch, so that the true > slowdown is the crappy filesystem.
William, I think we really need a Sage benchmark written which tests various aspects - single threaded, multi-threaded, integer, machine precision etc. But in the absence of that, I'm attaching two C programs which compute prime_pi(1000000) in the brute-force method even I can understand! No fancy number theory here. These were written by Andrew Gabriel at Sun, when defending the performance of the T5240. (To be fair to Andrew, he does not normally write C code with these sort of variable names, but was doing so to replicate a program posted by someone else.) I've called the two versions serial.c and parallel.c. The only change from Andrew's code is to increase N from 100,000 to 1,000,000 In each case, they were compiled with gcc -O3. The gcc version was either 4.3.3 or 4.3.4 Times are all quoted as minutes:seconds. There is data for three different Solaris machines and one HP-UX machine. 1) 't2' (SPARC) T5240 16 cores, 128 threads, 1167 MHz serial version : 20:18.411 parallel version: 0:16.430 That's speedup (serial time / parallel time) of 74x on 't2'. 2) My own Sun Ultra 27 (Xeon) Made in 2009 Sun Ultra 27 (3.333 GHz Intel Xeon, 4 cores, 8 threads) Serial version : 0:58.675 Parallel version: 0:15.851s That's a speedup of 3.7x times on my Ultra 27. 3) Sun Blade 2000 of mine (made in 2002) Sun Blade 2000 2 x 1200 MHz Serial version 15m7.146s Parallel version 8m27.613s That's a speedup of 1.8x on this dual processor Sun Blade 2000. 4) HP C3600 (made in 2000) 1 x 552 MHz single 1h:02m:31.1s parallel - N/A The HP C3600 has only one CPU, so there is no point measuring the parallel performance. The main points for this one integer task. 1) The single threaded performance is of my Ultra 27 is 20.8 x faster than 't2'!! So Marshall Hampton's statement about 't2' "Seems about 8 times slower at some basic tests I have compared to sagenb." does not surprise me at all. That is not quite as much as the 30x speed differential when building code, but a very significant difference, showing just how bad 't2' is for some tasks. 2) My Blade 2000, built in 2002, still has a single-threaded integer performance which is better than 't2'. 3) The performance of 't2' was increased by a factor of 74 by exploiting it better. 4) The parallel performance of 't2' is almost identical to the parallel performance of my Ultra 27. The Ultra 27 is clocked a lot faster, but has fare less cores and threads. The two seem to balance each other. 5) On this task, the 10-year old single processor HP C3600 is the slowest, even slower than 't2'. I know this is not a very conclusive test, but I think it shows just how bad 't2' is for some tasks. Unless you are going to get 100 simultaneous users on that machine, I suspect it will always be poorer than other machines. Without testing it, I don't know what 't2's floating point performance is like, but I would suspect it is very poor, as floating point performance is not needed in the market the T5240 is aimed at. I'd be interested in the times for some other machines, but really we do need a Sage benchmark to get some idea of how good/bad a machine is for running Sage. Dave -- To post to this group, send an email to sage-devel@googlegroups.com To unsubscribe from this group, send an email to sage-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sage-devel URL: http://www.sagemath.org
#include <stdio.h> int main(int argc, char *argv[]) { int I1; int I2 = 1000000; int I3; int I4; int I5; int I6 = 0; int I7; printf("N primes up to "); printf("%d", I2); printf(" is: "); for (I1 = 1; I1 <= I2; I1++) { I4 = I1 / 2; for (I3 = 2; I3 <= I4; I3++) { I5 = I1 % I3; if (!I5) break; } if (I3 > I4) { I6++; I7 = I1; } } printf("%d\n", I6); printf("last is: %d\n", I7); return 0; }
#include <pthread.h> #include <stdio.h> #include <unistd.h> #include <errno.h> #include <stdlib.h> #define MAXTHREADS 1024 struct args { int start; int end; }; static struct args threadargs[MAXTHREADS]; static pthread_mutex_t mux; static int threads; static int global_I2 = 1000000; static int global_I6 = 0; static int global_I7 = 0; static void * primes(void *arg) { struct args *args = (struct args *)arg; int I1 = args->start; int I2 = args->end; int I3; int I4; int I5; int I6 = 0; int I7; for (; I1 <= I2; I1++) { I4 = I1 / 2; for (I3 = 2; I3 <= I4; I3++) { I5 = I1 % I3; if (!I5) break; } if (I3 > I4) { I6++; I7 = I1; } } /* We're done. Update globals with our results */ pthread_mutex_lock(&mux); global_I6 += I6; if (I7 > global_I7) global_I7 = I7; if (--threads > 0) { /* Other threads still running */ pthread_mutex_unlock(&mux); pthread_detach(pthread_self()); /* no zombie */ pthread_exit(NULL); /* goodbye */ /*NOTREACHED*/ } pthread_mutex_unlock(&mux); /* We happen to be the last thread to finish - report results */ printf("N primes up to "); printf("%d", global_I2); printf(" is: "); printf("%d\n", global_I6); printf("last is: %d\n", global_I7); exit(0); } int main(int argc, char *argv[]) { int i; if (argc > 1) global_I2 = atoi(argv[1]); if (argc > 2) threads = atoi(argv[2]); else threads = sysconf(_SC_NPROCESSORS_ONLN) * 2; /* default threads to twice the number of CPUs */ if (threads > MAXTHREADS) threads = MAXTHREADS; if (threads > global_I2 / 10) /* that's just silly */ threads = global_I2 / 10; printf("Using %d threads\n", threads); pthread_mutex_init(&mux, NULL); /* setup threadargs array */ for (i = 0; i < threads; i++) { /* split the range up across all the threads */ threadargs[i].start = (global_I2 / threads * i) + 1; threadargs[i].end = global_I2 / threads * (i + 1); } /* correct any rounding error on last one */ threadargs[threads-1].end = global_I2; /* start all the threads */ pthread_mutex_lock(&mux); for (i = 0; i < threads; i++) { pthread_t tid; errno = pthread_create(&tid, NULL, primes, &threadargs[i]); if (errno != 0) { perror("pthread_create"); exit(1); } } pthread_mutex_unlock(&mux); pthread_detach(pthread_self()); /* no zombie */ pthread_exit(NULL); /* main thread finished - goodbye */ /*NOTREACHED*/ }