Re: [OMPI users] Execution in multicore machines

Torje Henriksen Tue, 30 Sep 2008 13:35:00 -0400

Hi!

If they are 8 core Intel machines, I believe this is the case:

*) Each pair of cores share an L2-cache. So using two cores that sharecache will probably reduce performance.*) Each Quad core CPU has its own memory bus (Dual independent bus),so using more than one core on a quad CPU can reduce performance ifthe bus is a bottle neck.

In your first case, both L2-cache and memory bus are shared. In yoursecond case, only memory bus is shared. In your third case no L2-cacheor memory bus are shared (The Linux scheduler maps processes so thatthey run on different CPUs if possible).

If you want another performance-case, you can map the processes suchthat they run on 4 different nodes, but share L2-cache. This can bedone by something like this mpirun -n 8 taskset -cp 0,4 LU.C.8 . CoreID 0 and 4 share L2-cache on our system at least. I guess you are notthat interested, but it is possible! :)

In addition I don't believe there is very much communication happeningin the LU-benchmark compared to the other NAS benchmarks.

All in all, I agree with both of you. Both the L2-cache and the memorybus are probably slowing you down.

As for the sys% time, I believe it is the NIC driver. The more inter-node communication, the more sys%. The shared memory communicationmodule (BTL SM) does all its communication in user space, as younoticed.



Best regards,

-Torje S. Henriksen

On Sep 30, 2008, at 6:55 PM, Jeff Squyres wrote:

Are these intel-based machines? I have seen similar effectsmentioned earlier in this thread where having all 8 cores banging onmemory pretty much kills performance on the UMA-style intel 8 coremachines. I'm not a hardware expert, but I've stayed away frombuying 8-core servers for exactly this reason. AMD's been NUMA allalong, and Intel's newer chips are NUMA to alleviate some of thisbus pressure.
~2x performance loss (between 8 and 4 cores on a single node) seemsa bit excessive, but I guess it could happen...? (I don't have anyhard numbers either way)
On Sep 29, 2008, at 2:30 PM, Leonardo Fialho wrote:
Hi All,
I´m doing some probes in a multi core (8 cores per node) machinewith NAS benchmarks. Something that I consider strange isoccurring...
I´m using only one NIC and paffinity:
./bin/mpirun
-n 8
--hostfile ./hostfile
--mca mpi_paffinity_alone 1
--mca btl_tcp_if_include eth1
--loadbalance
./codes/nas/NPB3.3/NPB3.3-MPI/bin/lu.C.8
I have sufficient memory to run this application in only one node,but:
1) If I use one node (8 cores) the "user" % is around 100% percore. The execution time is around 430 seconds.
2) If I use 2 nodes (4 cores in each node) the "user" % is around95% per core and the "sys" % is 5%. The execution time is around220 seconds.
3) If I use 4 nodes (1 cores in each node) the "user" % is around%85 per core and the "sys" % is 15%. The execution time is around200 seconds.
Well... the questions are:
A) The execution time in case "1" should be smaller (only smcommunication, no?) than case "2" and "3", no? Cache problems?
B) Why the "sys" time while using communication inter nodes? NICdriver? Why this time increase when I balance the load across thenodes?
Thanks,
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Execution in multicore machines

Reply via email to