Hi!
If they are 8 core Intel machines, I believe this is the case:
*) Each pair of cores share an L2-cache. So using two cores that share
cache will probably reduce performance.
*) Each Quad core CPU has its own memory bus (Dual independent bus),
so using more than one core on a quad CPU can reduce performance if
the bus is a bottle neck.
In your first case, both L2-cache and memory bus are shared. In your
second case, only memory bus is shared. In your third case no L2-cache
or memory bus are shared (The Linux scheduler maps processes so that
they run on different CPUs if possible).
If you want another performance-case, you can map the processes such
that they run on 4 different nodes, but share L2-cache. This can be
done by something like this mpirun -n 8 taskset -cp 0,4 LU.C.8 . Core
ID 0 and 4 share L2-cache on our system at least. I guess you are not
that interested, but it is possible! :)
In addition I don't believe there is very much communication happening
in the LU-benchmark compared to the other NAS benchmarks.
All in all, I agree with both of you. Both the L2-cache and the memory
bus are probably slowing you down.
As for the sys% time, I believe it is the NIC driver. The more inter-
node communication, the more sys%. The shared memory communication
module (BTL SM) does all its communication in user space, as you
noticed.
Best regards,
-Torje S. Henriksen
On Sep 30, 2008, at 6:55 PM, Jeff Squyres wrote:
Are these intel-based machines? I have seen similar effects
mentioned earlier in this thread where having all 8 cores banging on
memory pretty much kills performance on the UMA-style intel 8 core
machines. I'm not a hardware expert, but I've stayed away from
buying 8-core servers for exactly this reason. AMD's been NUMA all
along, and Intel's newer chips are NUMA to alleviate some of this
bus pressure.
~2x performance loss (between 8 and 4 cores on a single node) seems
a bit excessive, but I guess it could happen...? (I don't have any
hard numbers either way)
On Sep 29, 2008, at 2:30 PM, Leonardo Fialho wrote:
Hi All,
I´m doing some probes in a multi core (8 cores per node) machine
with NAS benchmarks. Something that I consider strange is
occurring...
I´m using only one NIC and paffinity:
./bin/mpirun
-n 8
--hostfile ./hostfile
--mca mpi_paffinity_alone 1
--mca btl_tcp_if_include eth1
--loadbalance
./codes/nas/NPB3.3/NPB3.3-MPI/bin/lu.C.8
I have sufficient memory to run this application in only one node,
but:
1) If I use one node (8 cores) the "user" % is around 100% per
core. The execution time is around 430 seconds.
2) If I use 2 nodes (4 cores in each node) the "user" % is around
95% per core and the "sys" % is 5%. The execution time is around
220 seconds.
3) If I use 4 nodes (1 cores in each node) the "user" % is around
%85 per core and the "sys" % is 15%. The execution time is around
200 seconds.
Well... the questions are:
A) The execution time in case "1" should be smaller (only sm
communication, no?) than case "2" and "3", no? Cache problems?
B) Why the "sys" time while using communication inter nodes? NIC
driver? Why this time increase when I balance the load across the
nodes?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users