[OMPI users] Bad parallel scaling using Code Saturne with openmpi
Hi. I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 OS. This server is composed of 4 system board of 2 processors of hexacores. So it gives 12 cores per system board. The boards are linked with a local Gbits switch. In order to parallelize the software Code Saturne, which is a CFD solver, I have configured the cluster such that there are a pbs server/mom on 1 system board and 3 mom and the 3 others cards. So this leads to 48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the openmpi 1.6 version. When I launch a simulation using 2 nodes with 12 cores, elapse time is good and network traffic is not full. But when I launch the same simulation using 3 nodes with 8 cores, elapse time is 5 times the previous one. I both cases, I use 24 cores and network seems not to be satured. I have tested several configurations : binaries in local file system or on a NFS. But results are the same. I have visited severals forums (in particular http://www.open-mpi.org/community/lists/users/2009/08/10394.php) and read lots of threads, but as I am not an expert at clusters, I presently do not see where it is wrong ! Is it a problem in the configuration of PBS (I have installed it from the deb packages), a subtile compilation options of openMPI, or a bad network configuration ? Regards. B. S.
Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
Thanks for your answer.You are right. I've tried upon 4 nodes with 6 processes and things are worst. So do you suggest that unique thing to do is to order an infiniband switch or is there a possibility to enhance something by tuning mca parameters ? De : Ralph Castain À : Dugenoux Albert ; Open MPI Users Envoyé le : Mardi 10 juillet 2012 16h47 Objet : Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi I suspect it mostly reflects communication patterns. I don't know anything about Saturne, but shared memory is a great deal faster than TCP, so the more processes sharing a node the better. You may also be hitting some natural boundary in your model - perhaps with 8 processes/node you wind up with more processes that cross the node boundary, further increasing the communication requirement. Do things continue to get worse if you use all 4 nodes with 6 processes/node? On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote: Hi. > >I have recently built a cluster upon a Dell PowerEdge Server with a Debian 6.0 >OS. This server is composed of >4 system board of 2 processors of hexacores. So it gives 12 cores per system >board. >The boards are linked with a local Gbits switch. > >In order to parallelize the software Code Saturne, which is a CFD solver, I >have configured the cluster >such that there are a pbs server/mom on 1 system board and 3 mom and the 3 >others cards. So this leads to >48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the >openmpi 1.6 version. > >When I launch a simulation using 2 nodes with 12 cores, elapse time is good >and network traffic is not full. >But when I launch the same simulation using 3 nodes with 8 cores, elapse time >is 5 times the previous one. >I both cases, I use 24 cores and network seems not to be satured. > >I have tested several configurations : binaries in local file system or on a >NFS. But results are the same. >I have visited severals forums (in particular >http://www.open-mpi.org/community/lists/users/2009/08/10394.php) >and read lots of threads, but as I am not an expert at clusters, I presently >do not see where it is wrong ! > >Is it a problem in the configuration of PBS (I have installed it from the deb >packages), a subtile compilation options >of openMPI, or a bad network configuration ? > >Regards. > >B. S.___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
Hi. To answer the differents remarks : 1) Code Saturne launch itself embedded python and bash scripts with the mpiexec parameters, but I will test your parameter next week and will give you the result of this benchmark. 2) I do not think there is a problem with the load balancing : Code Saturne partitions itself the mesh with the reliable and well-known Metis library which is the graph partitioner. So CPU are equally busy. 3) CPUs are Xeon which have multithreading capabilities. However I have tested it by setting np=24 in the server_priv/nodes file of the PBS server, and compared that with a configuration of np=12. The results are very similar : there is no gain of 20% or 30% 4) I will examine the hardware options as you have suggested but I will have to convince my office for such investissment ! De : Gus Correa À : Open MPI Users Envoyé le : Mercredi 11 juillet 2012 0h51 Objet : Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi On 07/10/2012 05:31 PM, Jeff Squyres wrote: > +1. Also, not all Ethernet switches are created equal -- > particularly commodity 1GB Ethernet switches. > I've seen plenty of crappy Ethernet switches rated for 1GB > that could not reach that speed when under load. > Are you perhaps belittling my dear $43 [brand undisclosed] 5-port GigE SoHo switch, that connects my Pentium-III toy cluster, just because it drops a few packages [per microsec]? It looks so good, with all those fiercely blinking green LEDs. Where else could I fool around with cluster setup and test the OpenMPI new releases? :) The production cluster is just too crowded for this, maybe because it has a decent HP GigE switch [IO] and Infiniband [MPI] ... Gus > > > On Jul 10, 2012, at 10:47 AM, Ralph Castain wrote: > >> I suspect it mostly reflects communication patterns. I don't know anything >> about Saturne, but shared memory is a great deal faster than TCP, so the >> more processes sharing a node the better. You may also be hitting some >> natural boundary in your model - perhaps with 8 processes/node you wind up >> with more processes that cross the node boundary, further increasing the >> communication requirement. >> >> Do things continue to get worse if you use all 4 nodes with 6 processes/node? >> >> >> On Jul 10, 2012, at 7:31 AM, Dugenoux Albert wrote: >> >>> Hi. >>> >>> I have recently built a cluster upon a Dell PowerEdge Server with a Debian >>> 6.0 OS. This server is composed of >>> 4 system board of 2 processors of hexacores. So it gives 12 cores per >>> system board. >>> The boards are linked with a local Gbits switch. >>> >>> In order to parallelize the software Code Saturne, which is a CFD solver, I >>> have configured the cluster >>> such that there are a pbs server/mom on 1 system board and 3 mom and the 3 >>> others cards. So this leads to >>> 48 cores dispatched on 4 nodes of 12 CPU. Code saturne is compiled with the >>> openmpi 1.6 version. >>> >>> When I launch a simulation using 2 nodes with 12 cores, elapse time is good >>> and network traffic is not full. >>> But when I launch the same simulation using 3 nodes with 8 cores, elapse >>> time is 5 times the previous one. >>> I both cases, I use 24 cores and network seems not to be satured. >>> >>> I have tested several configurations : binaries in local file system or on >>> a NFS. But results are the same. >>> I have visited severals forums (in particular >>> http://www.open-mpi.org/community/lists/users/2009/08/10394.php) >>> and read lots of threads, but as I am not an expert at clusters, I >>> presently do not see where it is wrong ! >>> >>> Is it a problem in the configuration of PBS (I have installed it from the >>> deb packages), a subtile compilation options >>> of openMPI, or a bad network configuration ? >>> >>> Regards. >>> >>> B. S. >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
Hello. As I promised, I send you results about different simulations and parameters according to the MPI options :TEST | DESCRIPTION SHARING MPI | WITH PBS | ELAPSE TIME 1ST ITERATION 1 Node 2 12 process yes no 0.21875E+03 2 Node 1 12 process yes no 0.21957E+03 3 Node 1, with 24 process to test multithreadin 24 process yes no 0.20613E+03 4 Node 2 12 process yes yes 0.22130E+03 5 Node 2, with 24 process to test multithreadin 24 process yes no 0.27300E+03 6 7 Nodes 1, 2 2 x 6 process yes yes 0.17304E+03 8 Nodes 1, 2 2 x 11 process yes yes 0.12395E+03 9 Nodes 1, 2 2 x 12 process yes yes 0.11812E+03 10 Nodes 3, 4 2 x 12 process yes yes0.11237E+03 11 Nodes 1,2,3 with 1 more process upon node 3 2 x 12 + 1 proces yes yes 0.56223E+03 12 Nodes 1,2,3;MPI options --bycore --bind-to-core 2 x 12 + 1 proces yes yes 0.32452E+03 13 Nodes 1,4,3 with 1 more process upon node 3 2 x 12 + 1 proces yes yes 0.37252E+03 14 Nodes 1,4,3;MPI options --bysocket --bind-to-sock 2 x 12 + 1 proces yes yes 0.5E+03 15 Nodes 1,4,3;MPI options --bycore --bind-to-core 2 x 12 + 1 proces yes yes 0.39983E+03 16 Nodes 2,3,4 3 x 12 process yes yes 0.85723E+03 17 Nodes 2,3,4 3 x 8 process yes yes 0.49378E+03 18 Nodes 1,2,3 3 x 8 process yes yes 0.51863E+03 19 Nodes 1,2,3,4 4 x 6 process yes yes 0.73272E+03 20 21 1,2,3,4; MPI options --bysocket --bind-to-socke 4 x 6 process yes yes 0.67739E+03 22 1,2,3,4; MPI options --bycore --bind-to-core4 x 6 process yes yes 0.69612E+03 The more surprising, even by taking in account latency between the nodes, are the tests 11 to 15. By adding only 1 process on the node 3, elapse time becomes 0.56e+03, i.e. 5 times the case 9 and 10. When partitioning upon 25 processors : 1 node represents 4% of the simulation (I have verified each partitions : they contain approximatively the sames number of elements plus or minus 8%), even one takes in account a latency factor of 10, i.e 40% more, one should obtain (for test 10) : 0.11e+03 x 1.40 ~= 0.154e+03 sec. In addition, when I observe the data transfers upon the eth0 connexion during an iteration, I see that when node 1 and 2 transfer, for example 5 Mo, then node 3 transfers 2,5 Mo. But if we consider that node 3 is concerned by 4% of the data simulation, it should only need 200 Ko ! Results are very differents too, between options with binding socket or binding core, as tests 13, 14 and 15 show. Regards. Albert