Hi!

If they are 8 core Intel machines, I believe this is the case:

*) Each pair of cores share an L2-cache. So using two cores that share cache will probably reduce performance. *) Each Quad core CPU has its own memory bus (Dual independent bus), so using more than one core on a quad CPU can reduce performance if the bus is a bottle neck.

In your first case, both L2-cache and memory bus are shared. In your second case, only memory bus is shared. In your third case no L2-cache or memory bus are shared (The Linux scheduler maps processes so that they run on different CPUs if possible).

If you want another performance-case, you can map the processes such that they run on 4 different nodes, but share L2-cache. This can be done by something like this mpirun -n 8 taskset -cp 0,4 LU.C.8 . Core ID 0 and 4 share L2-cache on our system at least. I guess you are not that interested, but it is possible! :)

In addition I don't believe there is very much communication happening in the LU-benchmark compared to the other NAS benchmarks.

All in all, I agree with both of you. Both the L2-cache and the memory bus are probably slowing you down.

As for the sys% time, I believe it is the NIC driver. The more inter- node communication, the more sys%. The shared memory communication module (BTL SM) does all its communication in user space, as you noticed.


Best regards,

-Torje S. Henriksen

On Sep 30, 2008, at 6:55 PM, Jeff Squyres wrote:

Are these intel-based machines? I have seen similar effects mentioned earlier in this thread where having all 8 cores banging on memory pretty much kills performance on the UMA-style intel 8 core machines. I'm not a hardware expert, but I've stayed away from buying 8-core servers for exactly this reason. AMD's been NUMA all along, and Intel's newer chips are NUMA to alleviate some of this bus pressure.

~2x performance loss (between 8 and 4 cores on a single node) seems a bit excessive, but I guess it could happen...? (I don't have any hard numbers either way)


On Sep 29, 2008, at 2:30 PM, Leonardo Fialho wrote:

Hi All,

I´m doing some probes in a multi core (8 cores per node) machine with NAS benchmarks. Something that I consider strange is occurring...

I´m using only one NIC and paffinity:
./bin/mpirun
-n 8
--hostfile ./hostfile
--mca mpi_paffinity_alone 1
--mca btl_tcp_if_include eth1
--loadbalance
./codes/nas/NPB3.3/NPB3.3-MPI/bin/lu.C.8

I have sufficient memory to run this application in only one node, but:

1) If I use one node (8 cores) the "user" % is around 100% per core. The execution time is around 430 seconds.

2) If I use 2 nodes (4 cores in each node) the "user" % is around 95% per core and the "sys" % is 5%. The execution time is around 220 seconds.

3) If I use 4 nodes (1 cores in each node) the "user" % is around %85 per core and the "sys" % is 15%. The execution time is around 200 seconds.

Well... the questions are:

A) The execution time in case "1" should be smaller (only sm communication, no?) than case "2" and "3", no? Cache problems?

B) Why the "sys" time while using communication inter nodes? NIC driver? Why this time increase when I balance the load across the nodes?

Thanks,
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to