In order to use multi-core CPU efficiently, you better give each core a bigger task to reduce communication overhead.
Weimin ----- Original Message ----- From: "terry mcintyre" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Tuesday, January 01, 2008 10:01 AM Subject: [computer-go] OpenMP / Quad Core experiments I have been tinkering with OpenMP and my new HP Quad Intel 6600. Wrote a small program to compute the Taylor series of e and pi, just for exploration, and I've found some interesting data points. I am using gcc 4.2 and 4.3.1 - the latter being the head of the SVN repository. Kubuntu 7.10, both 32 and 64 bit versions. One of my test programs is attached. Oddly, the OpenMP version is no faster than the single-threaded version - but it does keep the cores busier. It is possible that I am doing something wrong, as I am new to OpenMP. I was so puzzled by the results that I tried the same program on my AMD Athlon X2. The older AMD Athlon duo, with a 1 GHz clock, 64-bit Fedora Core 7, is 20% faster than the 1.6GHz quad 6600. I've also run the --monte-carlo version of GnuGo 4.7.11 on both machines, with similar results. The compilation line is: gcc -Wall -fopenmp -O3 -march=native -lgomp taylor3.c -o taylor3 ( the code is an adaptation of code from the OpenMP tutorial at http://kallipolis.com/openmp/ - which leads to another interesting discovery. The original code yields incorrect results for pi; the two parallel branches use the same index variable i, and one stomps on the other. Is this a feature of the gcc version of OpenMP? I'll be testing Intel's icc soon. ) I'll be doing more testing this weekend, but I'd like to know if anyone has compared the Intel 6600 to other processors. So far, it sure looks like a tired old nag on her last ride to the glue factory; I'm wishing that I had waited for the Penryn version. One more puzzle: this processor is rated at 2.4GHz, but cpuinfo tells a different story: [EMAIL PROTECTED]:/proc$ cat cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 1596.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4804.08 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Terry McIntyre <[EMAIL PROTECTED]> "Wherever is found what is called a paternal government, there is found state education. It has been discovered that the best way to insure implicit obedience is to commence tyranny in the nursery." Benjamin Disraeli, Speech in the House of Commons [June 15, 1874] ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs -------------------------------------------------------------------------------- /* * taylor.c * * calculate e and pi by their taylor expansions and multiply them * together. * * moved local variables inside parallel blocks ( performance tweak? ) */ #include <omp.h> #include <stdio.h> #include <time.h> #define num_steps 20000000 int main(int argc, char *argv[]) { double start, stop; /* times of beginning and end of procedure */ double efinal, pifinal, product; /* start the timer */ start = clock(); /* calculate e and pi in parallel */ #pragma omp parallel sections shared(efinal,pifinal) { #pragma omp section { /* calculate e using Taylor approximation */ register double e, factorial; register int j; e = 1; factorial = 1; for (j = 1; j<num_steps; j++) { factorial *= j; e += 1.0/factorial; } efinal=e; } /* e section */ #pragma omp section { /* calculate pi expansion */ register int i; register double pi; pi = 0; for (i = 0; i < num_steps*10; i++) { /* we want 1/1 - 1/3 + 1/5 - 1/7 etc. therefore we count by fours (0, 4, 8, 12...) and take 1/(0+1) = 1/1 - 1/(0+3) = -1/3 1/(4+1) = 1/5 - 1/(4+3) = -1/7 and so on */ pi += 1.0/(i*4.0 + 1.0); pi -= 1.0/(i*4.0 + 3.0); } pi = pi * 4.0; pifinal=pi; } /* pi section */ } /* omp sections */ /* threads rejoin here */ product = efinal * pifinal; stop = clock(); printf("e %f pi %f products = %f reached in %.3f seconds\n", efinal, pifinal, product, (double)(stop-start)/CLOCKS_PER_SEC); return 0; } -------------------------------------------------------------------------------- _______________________________________________ computer-go mailing list [email protected] http://www.computer-go.org/mailman/listinfo/computer-go/ _______________________________________________ computer-go mailing list [email protected] http://www.computer-go.org/mailman/listinfo/computer-go/
