Am 29.01.2014 um 03:00 schrieb Victor: > I am running a CFD simulation benchmark cavity3d available within > http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz > > It is a parallel friendly Lattice Botlzmann solver library. > > Palabos provides benchmark results for the cavity3d on several different > platforms and variables here: > http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400 > > The problem that I have is that the benchmark performance on my cluster does > not scale even close to a linear scale. > > My cluster configuration: > > Node1: Dual Xeon 5560 48 Gb RAM > Node2: i5-2400 24 Gb RAM > > Gigabit ethernet connection on eth0 > > OpenMPI 1.6.5 on Ubuntu 12.04.3 > > > Hostfile: > > Node1 -slots=4 -max-slots=4 > Node2 -slots=4 -max-slots=4 > > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile > /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 > > Problem: > > cavity3d 400 > > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per second > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per second > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile > /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site > updates per second > > I understand that there are latencies with GbE and that there is MPI > overhead, but this performance scaling still seems very poor. Are my > expectations of scaling naive, or is there actually something wrong and > fixable that will improve the scaling? Optimistically I would like each node > to add to the cluster performance, not slow it down. > > Things get even worse if I run asymmetric number of mpi jobs in each node. > For instance running -np 12 on Node1
Isn't this overloading the machine with only 8 real cores in total? > is significantly faster than running -np 16 across Node1 and Node2, thus > adding Node2 actually slows down the performance. The i5-2400 has only 4 cores and no threads. It depends on the algorithm how much data has to be exchanged between the processes, and this can indeed be worse when used across a network. Also: is the algorithm scaling linear when used on node1 only with 8 cores? When it's "35.7615 " with 4 cores, what result do you get with 8 cores on this machine. -- Reuti