On 07/10/2012 03:54 PM, David Warren wrote:
Your problem may not be related to bandwidth. It may be latency or
division of the problem. We found significant improvements running wrf
and other atmospheric code (CFD) over IB. The problem was not so much
the amount of data communicated, but how long it takes to send it. Also,
is your model big enough to split up as much as you are trying? If there
is not enough work for each core to do between edge exchanges, you will
spend all of your time spinning waiting for the network. If you are
running a demo benchmark it is likely too small for the number of
processors. At least that is what we find with most weather model demo
problems. One other thing to look at is how it is being split up.
Depending on what the algorithm does, you may want more x points, more y
points or completely even divisions. We found that we can significantly
speed up wrf for our particular domain by not lett

On 07/10/12 08:48, Dugenoux Albert wrote:
Thanks for your answer.You are right.
I've tried upon 4 nodes with 6 processes and things are worst.
So do you suggest that unique thing to do is to order an infiniband
switch or is there a possibility to enhance
something by tuning mca parameters ?

*De :* Ralph Castain <r...@open-mpi.org>
*À :* Dugenoux Albert <dugeno...@yahoo.fr>; Open MPI Users
<us...@open-mpi.org>
*Envoyé le :* Mardi 10 juillet 2012 16h47
*Objet :* Re: [OMPI users] Bad parallel scaling using Code Saturne
with openmpi

I suspect it mostly reflects communication patterns. I don't know
anything about Saturne, but shared memory is a great deal faster than
TCP, so the more processes sharing a node the better. You may also be
hitting some natural boundary in your model - perhaps with 8
processes/node you wind up with more processes that cross the node
boundary, further increasing the communication requirement.

Do things continue to get worse if you use all 4 nodes with 6
processes/node?


On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:

Hi.
I have recently built a cluster upon a Dell PowerEdge Server with a
Debian 6.0 OS. This server is composed of
4 system board of 2 processors of hexacores. So it gives 12 cores per
system board.
The boards are linked with a local Gbits switch.
In order to parallelize the software Code Saturne, which is a CFD
solver, I have configured the cluster
such that there are a pbs server/mom on 1 system board and 3 mom and
the 3 others cards. So this leads to
48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled
with the openmpi 1.6 version.
When I launch a simulation using 2 nodes with 12 cores, elapse time
is good and network traffic is not full.
But when I launch the same simulation using 3 nodes with 8 cores,
elapse time is 5 times the previous one.
I both cases, I use 24 cores and network seems not to be satured.
I have tested several configurations : binaries in local file system
or on a NFS. But results are the same.
I have visited severals forums (in particular
http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
and read lots of threads, but as I am not an expert at clusters, I
presently do not see where it is wrong !
Is it a problem in the configuration of PBS (I have installed it from
the deb packages), a subtile compilation options
of openMPI, or a bad network configuration ?
Regards.

Hi Albert

1) Have you tried to bind processes to cores:

mpiexec -bycore -bind-to-core -np ...

Sometimes it improves performance.

2) Packing as much work as possible in the fewest nodes
[and reducing MPI communication to mostly shared memory]
always helps, as Ralph said.

3) Infinband low latency and bandwith helps, as David said.
It will cost you a switch plus a HCA card adapter for each node.
This can give you an idea of prices:
http://www.colfaxdirect.com/store/pc/viewCategories.asp?idCategory=7
http://www.colfaxdirect.com/store/pc/viewCategories.asp?idCategory=6

4) Not all code scales well with the number of processors.
The norm is to be sub-linear, even when the problem is large enough
to avoid the 'too much communication per computation' situation that
David referred to.
Some have humps and bumps and sweet spots in the scaling curve,
which may depend on the specifics of domain dimensions, etc.

5) However, your CFD code may have some parameters/knobs to
control the way the mesh is decomposed, and how the subdomains
are distributed across the processors,
which may help you find the load balance for
each problem type and size. This comes part by trial-and-error,
part by educated guesses.

I hope this helps,
Gus Correa

PS - Are your processors Intel [Nehalem or later]?
Is hyperthreading turned on [on BIOS perhaps]?
This would give you 24 virtual cores per node.
With hyperthreading turned on
you may need to adjust your $TORQUE_PBS/server_priv/nodes
file to read something like:
node01 np=24
instead of np=12, and then restart the pbs_server.
It may be worth testing the code both ways,
as sometimes hyperthreading sucks,
sometimes it shines, depending on the code.
[Or better, it may not give you
twice the speed, but may give you 20-30% more speed,
which is not negligible.]

Reply via email to