Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within 
> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
> 
> It is a parallel friendly Lattice Botlzmann solver library.
> 
> Palabos provides benchmark results for the cavity3d on several different 
> platforms and variables here: 
> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
> 
> The problem that I have is that the benchmark performance on my cluster does 
> not scale even close to a linear scale.
> 
> My cluster configuration:
> 
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
> 
> Gigabit ethernet connection on eth0
> 
> OpenMPI 1.6.5 on Ubuntu 12.04.3
> 
> 
> Hostfile:
> 
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
> 
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile 
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
> 
> Problem:
> 
> cavity3d 400
> 
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile 
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site 
> updates per second
> 
> I understand that there are latencies with GbE and that there is MPI 
> overhead, but this performance scaling still seems very poor. Are my 
> expectations of scaling naive, or is there actually something wrong and 
> fixable that will improve the scaling? Optimistically I would like each node 
> to add to the cluster performance, not slow it down. 
> 
> Things get even worse if I run asymmetric number of mpi jobs in each node. 
> For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?


> is significantly faster than running -np 16 across Node1 and Node2, thus 
> adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged between the 
processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8 cores? 
When it's "35.7615 " with 4 cores, what result do you get with 8 cores on this 
machine.

-- Reuti

Reply via email to