[OMPI users] Bad parallel scaling using Code Saturne with openmpi

2012-07-10 Thread Dugenoux Albert
Hi.
 
I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 
OS. This server is composed of 
4 system board of 2 processors of hexacores. So it gives 12 cores per system 
board.
The boards are linked with a local Gbits switch. 
 
In order to parallelize the software Code Saturne, which is a CFD solver, I 
have configured the cluster
such that there are a pbs server/mom on 1 system board and 3 mom and the 3 
others cards. So this leads to 
48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the 
openmpi 1.6 version.
 
When I launch a simulation using 2 nodes with 12 cores, elapse time is good and 
network traffic is not full.
But when I launch the same simulation using 3 nodes with 8 cores, elapse time 
is 5 times the previous one.
I both cases, I use 24 cores and network seems not to be satured. 
 
I have tested several configurations : binaries in local file system or on a 
NFS. But results are the same.
I have visited severals forums (in particular 
http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
and read lots of threads, but as I am not an expert at clusters, I presently do 
not see where it is wrong !
 
Is it a problem in the configuration of PBS (I have installed it from the deb 
packages), a subtile compilation options
of openMPI, or a bad network configuration ?
 
Regards.
 
B. S.

Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

2012-07-10 Thread Dugenoux Albert
Thanks for your answer.You are right.
 I've tried upon 4 nodes with 6 processes and things are worst.
 
So do you suggest that unique thing to do is to order an infiniband switch or 
is there a possibility to enhance
something by tuning mca parameters ?
 



De : Ralph Castain 
À : Dugenoux Albert ; Open MPI Users  
Envoyé le : Mardi 10 juillet 2012 16h47
Objet : Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi


I suspect it mostly reflects communication patterns. I don't know anything 
about Saturne, but shared memory is a great deal faster than TCP, so the more 
processes sharing a node the better. You may also be hitting some natural 
boundary in your model - perhaps with 8 processes/node you wind up with more 
processes that cross the node boundary, further increasing the communication 
requirement. 

Do things continue to get worse if you use all 4 nodes with 6 processes/node?



On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:

Hi.
>
>I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 
>OS. This server is composed of 
>4 system board of 2 processors of hexacores. So it gives 12 cores per system 
>board.
>The boards are linked with a local Gbits switch. 
>
>In order to parallelize the software Code Saturne, which is a CFD solver, I 
>have configured the cluster
>such that there are a pbs server/mom on 1 system board and 3 mom and the 3 
>others cards. So this leads to 
>48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the 
>openmpi 1.6 version.
>
>When I launch a simulation using 2 nodes with 12 cores, elapse time is good 
>and network traffic is not full.
>But when I launch the same simulation using 3 nodes with 8 cores, elapse time 
>is 5 times the previous one.
>I both cases, I use 24 cores and network seems not to be satured. 
>
>I have tested several configurations : binaries in local file system or on a 
>NFS. But results are the same.
>I have visited severals forums (in particular 
>http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
>and read lots of threads, but as I am not an expert at clusters, I presently 
>do not see where it is wrong !
>
>Is it a problem in the configuration of PBS (I have installed it from the deb 
>packages), a subtile compilation options
>of openMPI, or a bad network configuration ?
>
>Regards.
>
>B. S.___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

2012-07-11 Thread Dugenoux Albert
Hi.

To answer the differents remarks :

1) Code Saturne launch itself embedded python and bash scripts with the mpiexec 
parameters, but I will test
your parameter next week and will give you the result of this benchmark.


2) I do not think there is a problem with the load balancing : Code Saturne 
partitions itself
the mesh with the reliable and well-known Metis library which is the graph 
partitioner. So CPU
are equally busy. 


3) CPUs are Xeon which have multithreading capabilities. However I have tested 
it
by setting np=24 in the server_priv/nodes file of the PBS server, and compared 
that
with a configuration of np=12. The results are very similar : there is no gain 
of 20% or 30%

4) I will examine the hardware options as you have suggested but I will have to 
convince my
office for such investissment !



 De : Gus Correa 
À : Open MPI Users  
Envoyé le : Mercredi 11 juillet 2012 0h51
Objet : Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
 
On 07/10/2012 05:31 PM, Jeff Squyres wrote:
> +1.  Also, not all Ethernet switches are created equal --
> particularly commodity 1GB Ethernet switches.
> I've seen plenty of crappy Ethernet switches rated for 1GB
> that could not reach that speed when under load.
>

Are you perhaps belittling my dear $43 [brand undisclosed]
5-port GigE SoHo switch, that connects my Pentium-III
toy cluster, just because it drops a few packages [per microsec]?
It looks so good, with all those fiercely blinking green LEDs.
Where else could I fool around with cluster setup and test
the OpenMPI new releases? :)
The production cluster is just too crowded for this,
maybe because it has a decent
HP GigE switch [IO] and Infiniband [MPI] ...

Gus


>
>
> On Jul 10, 2012, at 10:47 AM, Ralph Castain wrote:
>
>> I suspect it mostly reflects communication patterns. I don't know anything 
>> about Saturne, but shared memory is a great deal faster than TCP, so the 
>> more processes sharing a node the better. You may also be hitting some 
>> natural boundary in your model - perhaps with 8 processes/node you wind up 
>> with more processes that cross the node boundary, further increasing the 
>> communication requirement.
>>
>> Do things continue to get worse if you use all 4 nodes with 6 processes/node?
>>
>>
>> On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote:
>>
>>> Hi.
>>>
>>> I have recently built a cluster upon a Dell PowerEdge Server with a Debian 
>>> 6.0 OS. This server is composed of
>>> 4 system board of 2 processors of hexacores. So it gives 12 cores per 
>>> system board.
>>> The boards are linked with a local Gbits switch.
>>>
>>> In order to parallelize the software Code Saturne, which is a CFD solver, I 
>>> have configured the cluster
>>> such that there are a pbs server/mom on 1 system board and 3 mom and the 3 
>>> others cards. So this leads to
>>> 48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the 
>>> openmpi 1.6 version.
>>>
>>> When I launch a simulation using 2 nodes with 12 cores, elapse time is good 
>>> and network traffic is not full.
>>> But when I launch the same simulation using 3 nodes with 8 cores, elapse 
>>> time is 5 times the previous one.
>>> I both cases, I use 24 cores and network seems not to be satured.
>>>
>>> I have tested several configurations : binaries in local file system or on 
>>> a NFS. But results are the same.
>>> I have visited severals forums (in particular 
>>> http://www.open-mpi.org/community/lists/users/2009/08/10394.php)
>>> and read lots of threads, but as I am not an expert at clusters, I 
>>> presently do not see where it is wrong !
>>>
>>> Is it a problem in the configuration of PBS (I have installed it from the 
>>> deb packages), a subtile compilation options
>>> of openMPI, or a bad network configuration ?
>>>
>>> Regards.
>>>
>>> B. S.
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi

2012-07-17 Thread Dugenoux Albert
Hello.
As I promised, I send you results about different simulations and parameters 
according
to the MPI options :TEST | DESCRIPTION  
 SHARING        MPI | WITH PBS | ELAPSE TIME 1ST ITERATION 
1      Node 2   
  12 process  yes    no 0.21875E+03 
2      Node 1   
  12 process  yes    no 0.21957E+03 
3  Node 1, with 24 process to test multithreadin 24 process 
 yes    no 0.20613E+03 
4  Node 2   
      12 process  yes    yes    0.22130E+03 
5  Node 2, with 24 process to test multithreadin      24 
process  yes    no 0.27300E+03 
6 
7  Nodes 1, 2   
  2 x 6 process      yes    yes    0.17304E+03 
8  Nodes 1, 2   
  2 x 11 process    yes    yes    0.12395E+03 
9  Nodes 1, 2       
  2 x 12 process    yes    yes    0.11812E+03 
10    Nodes 3, 4
 2 x 12 process    yes    yes0.11237E+03 
11    Nodes 1,2,3 with 1 more process upon node 3   2 x 12 + 1 
proces    yes    yes    0.56223E+03 
12    Nodes 1,2,3;MPI options --bycore --bind-to-core 2 x 12 + 1 proces 
   yes    yes    0.32452E+03 
13    Nodes 1,4,3 with 1 more process upon node 3   2 x 12 + 1 
proces    yes    yes    0.37252E+03 
14    Nodes 1,4,3;MPI options --bysocket --bind-to-sock 2 x 12 + 1 proces   
 yes    yes    0.5E+03 
15    Nodes 1,4,3;MPI options --bycore --bind-to-core 2 x 12 + 1 proces 
   yes    yes    0.39983E+03 
16    Nodes 2,3,4   
    3 x 12 process    yes    yes 0.85723E+03 
17    Nodes 2,3,4   
    3 x 8 process  yes    yes 0.49378E+03 
18    Nodes 1,2,3   
    3 x 8 process  yes    yes 0.51863E+03 
19    Nodes 1,2,3,4 
   4 x 6 process  yes    yes 0.73272E+03 
20 
21    1,2,3,4; MPI options --bysocket --bind-to-socke   4 x 6 process   
   yes    yes 0.67739E+03 
22    1,2,3,4; MPI options --bycore --bind-to-core4 x 6 process 
 yes yes 0.69612E+03 
 The more surprising, even by taking in account latency between the nodes, are 
the tests
11 to 15. By adding only 1 process on the node 3, elapse time becomes 0.56e+03, 
i.e.
5 times the case 9 and 10. When partitioning upon 25 processors : 1 node 
represents 4%
of the simulation (I have verified each partitions : they contain 
approximatively the sames number
of elements plus or minus 8%), even one takes in account a latency factor of 
10, i.e 40% more,
one should obtain (for test 10) : 0.11e+03 x 1.40 ~= 0.154e+03 sec.
 
In addition, when I observe the data transfers upon the eth0 connexion during 
an iteration, I see that
when node 1 and 2 transfer, for example 5 Mo, then node 3 transfers 2,5 Mo. But 
if we consider that
node 3 is concerned by 4% of the data simulation, it should only need 200 Ko !
 
Results are very differents too, between options with binding socket or binding 
core, as tests 13, 14 and 15 show.
 
Regards.
Albert