On Mon 2008-09-29 20:30, Leonardo Fialho wrote: > 1) If I use one node (8 cores) the "user" % is around 100% per core. The > execution time is around 430 seconds. > > 2) If I use 2 nodes (4 cores in each node) the "user" % is around 95% > per core and the "sys" % is 5%. The execution time is around 220 seconds. > > 3) If I use 4 nodes (1 cores in each node) the "user" % is around %85 > per core and the "sys" % is 15%. The execution time is around 200 > seconds.
Do you mean 2 cores per node (1 core per socket). > Well... the questions are: > > A) The execution time in case "1" should be smaller (only sm > communication, no?) than case "2" and "3", no? Cache problems? Is this benchmark memory bandwidth limited? Your results are fairly typical for sparse matrix kernels. One core can more or less saturate the bus on its own, two cores can overlap memory access so it doesn't hurt too much, more than two and they are all waiting on memory. The extra cores are cheaper than more sockets but they don't do much/any good for many workloads. > B) Why the "sys" time while using communication inter nodes? NIC driver? > Why this time increase when I balance the load across the nodes? Messages over Ethernet cost more than messages in shared memory. When you only use 1 core per socket, the application is faster because the single thread has the full memory bandwidth to itself, however MPI needs to move more data over the wire so that phase costs more. If your network was faster (e.g. InfiniBand) you could expect the communication to stay quite cheap even with only one process per node. Jed
pgp5Y_f0ZRGxY.pgp
Description: PGP signature