Jed Brown escribió:
On Mon 2008-09-29 20:30, Leonardo Fialho wrote:
1) If I use one node (8 cores) the "user" % is around 100% per core. The execution time is around 430 seconds.

2) If I use 2 nodes (4 cores in each node) the "user" % is around 95% per core and the "sys" % is 5%. The execution time is around 220 seconds.

3) If I use 4 nodes (*2* cores in each node) the "user" % is around %85 per core and the "sys" % is 15%. The execution time is around 200 seconds.
Do you mean 2 cores per node (1 core per socket).
Exactly, sorry.
Well... the questions are:

A) The execution time in case "1" should be smaller (only sm communication, no?) than case "2" and "3", no? Cache problems?
Is this benchmark memory bandwidth limited?  Your results are fairly
typical for sparse matrix kernels.  One core can more or less saturate
the bus on its own, two cores can overlap memory access so it doesn't
hurt too much, more than two and they are all waiting on memory.  The
extra cores are cheaper than more sockets but they don't do much/any
good for many workloads.
B) Why the "sys" time while using communication inter nodes? NIC driver? Why this time increase when I balance the load across the nodes?
Messages over Ethernet cost more than messages in shared memory.  When
you only use 1 core per socket, the application is faster because the
single thread has the full memory bandwidth to itself, however MPI needs
to move more data over the wire so that phase costs more.  If your
network was faster (e.g. InfiniBand) you could expect the communication
to stay quite cheap even with only one process per node.
The nodes have 2 sockets with 4 cores in each.

In other words... in this case ("2" and "3"), the concurrency for the bus/memory by more than 2 tasks is worser than the Giga Ethernet?

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to