[OMPI users] Spawn and distribution of slaves
Hello, Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the machinefile specifies that the slaves should run on different nodes. With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI) The code (Fortran 90) of this example and makefile is attached as a tar file. Thank you very much Jean Latour spawn+connect.tar.gz Description: Binary data <>
Re: [OMPI users] Spawn and distribution of slaves
Thanks for your answer. Your example address one possible situation where a parallel application is spawned by a driver with MPI_Comm_Spawn, or multiple parallel applications are spawned at the same time with a MPI_Comm_Span_Multiple, over a set of processors described in the machinefile. It is OK if the next spawn occurs after some processes at the beginning of the machinefile have stopped. However I have in hands another case where the spawn processes are really dynamic over time. Any child processes can stop (not necessarily the first in the machinefile), and thus they are freeing some processors on which the new spawned processes must be running. With LAM_MPI this situation has a satisfactory solution with the INFO parameter of the MPI_Comm_Spawn. It allows to specify a "local" machinefile for these spawned processes, instead of taking always the same machinefile from the beginning as in your example. Do you know if this specific feature will be implemented in Open-MPI (I hope it will be), and possibly when ? Dynamic applications really need this. Best Regards, Jean Latour Edgar Gabriel wrote: so for my tests, Open MPI did follow the machinefile (see output) further below, however, for each spawn operation it starts from the very beginning of the machinefile... The following example spawns 5 child processes (with a single MPI_Comm_spawn), and each child prints its rank and the hostname. gabriel@linux12 ~/dyncomm $ mpirun -hostfile machinefile -np 3 ./dyncomm_spawn_father Checking for MPI_Comm_spawn.working Hello world from child 0 on host linux12 Hello world from child 1 on host linux13 Hello world from child 3 on host linux15 Hello world from child 4 on host linux16 Testing Send/Recv on the intercomm..working Hello world from child 2 on host linux14 with the machinefile being: gabriel@linux12 ~/dyncomm $ cat machinefile linux12 linux13 linux14 linux15 linux16 In your code, you always spawn 1 process at the time, and that's why they are all located on the same node. Hope this helps... Edgar Edgar Gabriel wrote: as far as I know, Open MPI should follow the machinefile for spawn operations, starting however for every spawn at the beginning of the machinefile again. An info object such as 'lam_sched_round_robin' is currently not available/implemented. Let me look into this... Jean Latour wrote: Hello, Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the machinefile specifies that the slaves should run on different nodes. With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI) The code (Fortran 90) of this example and makefile is attached as a tar file. Thank you very much Jean Latour ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users <>
Re: [OMPI users] Spawn and Disconnect
Just to add an example that may help to this "disconnect" discussion : Attached is the code of a test that does the following (and it works perfectly with OpenMPI 1.0.1) 1) master spawns slave1 2) master spawns slave2 3) exechange messages between master and slaves over intercommunicator 4) slave1 disconnects from master and finalize 5) slave2 disconnects from master and finalize (the processors used by slave 1 and slave 2 can now be re-used by new spawned processes) 6) master spawns slave3, and then slave4 7) slave3 and slave4 have NO direct communicator, but they can create one through the Open-Port mechanism and the MPI_Connect / MPI_Accept functions. The port number is relayed through the master. 8) slave3 and slave4 create this direct communicator and do some pingpong over it 9) slave3 and slave4 disconnect from each other on this direct communicator 10) slave3 and slave4 disconnect from master an finalize 11) master finalize Hope it helps Best regards, Jean Latour Ralph Castain wrote: We expect to have much better support for the entire comm_spawn process in the next incarnation of the RTE. I don't expect that to be included in a release, however, until 1.1 (Jeff may be able to give you an estimate for when that will happen). Jeff et al may be able to give you access to an early non-release version sooner, if better comm_spawn support is a critical issue and you don't mind being patient with the inevitable bugs in such versions. Ralph Edgar Gabriel wrote: Open MPI currently does not fully support a proper disconnection of parent and child processes. Thus, if a child dies/aborts, the parents will abort as well, despite of calling MPI_Comm_disconnect. (The new RTE will have better support for these operations, Ralph/Jeff can probably give a better estimate when this will be available.) However, what should not happen is, that if the child calls MPI_Finalize (so not a violent death but a proper shutdown), the parent goes down at the same time. Let me check that as well... Brignone, Sergio wrote: Hi everybody, I am trying to run a master/slave set. Because of the nature of the problem I need to start and stop (kill) some slaves. The problem is that as soon as one of the slave dies, the master dies also. This is what I am doing: MASTER: MPI_Init(...) MPI_Comm_spawn(slave1,...,nslave1,...,intercomm1); MPI_Barrier(intercomm1); MPI_Comm_disconnect(&intercomm1); MPI_Comm_spawn(slave2,...,nslave2,...,intercomm2); MPI_Barrier(intercomm2); MPI_Comm_disconnect(&intercomm2); MPI_Finalize(); SLAVE: MPI_Init(...) MPI_Comm_get_parent(&intercomm); (does something) MPI_Barrier(intercomm); MPI_Comm_disconnect(&intercomm); MPI_Finalize(); The issue is that as soon as the first set of slaves calls MPI_Finalize, the master dies also (it dies right after MPI_Comm_disconnect(&intercomm1) ) What am I doing wrong? Thanks Sergio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users spawn+connect.tar.gz Description: Binary data <>
[OMPI users] Performance of ping-pong using OpenMPI over Infiniband
Hello, Testing performance of OpenMPI over Infiniband I have the following result : 1) Hardware is : SilversStorm interface 2) Openmpi version is : (from ompi_info) Open MPI: 1.0.2a9r9159 Open MPI SVN revision: r9159 Open RTE: 1.0.2a9r9159 Open RTE SVN revision: r9159 OPAL: 1.0.2a9r9159 OPAL SVN revision: r9159 3) Cluster with Bi-processors Opteron 248 2.2 GHz Configure has been run with option --with-mvapi=path-to-mvapi 4) a C coded pinpong gives the following values : LOOPS: 1000 BYTES: 4096 SECONDS: 0.085557 MBytes/sec: 95.749051 LOOPS: 1000 BYTES: 8192 SECONDS: 0.050657 MBytes/sec: 323.429912 LOOPS: 1000 BYTES: 16384 SECONDS: 0.084038 MBytes/sec: 389.918757 LOOPS: 1000 BYTES: 32768 SECONDS: 0.163161 MBytes/sec: 401.665104 LOOPS: 1000 BYTES: 65536 SECONDS: 0.306694 MBytes/sec: 427.370561 LOOPS: 1000 BYTES: 131072 SECONDS: 0.529589 MBytes/sec: 494.995011 LOOPS: 1000 BYTES: 262144 SECONDS: 0.952616 MBytes/sec: 550.366583 LOOPS: 1000 BYTES: 524288 SECONDS: 1.927987 MBytes/sec: 543.870859 LOOPS: 1000 BYTES: 1048576 SECONDS: 3.673732 MBytes/sec: 570.850562 LOOPS: 1000 BYTES: 2097152 SECONDS: 9.993185 MBytes/sec: 419.716435 LOOPS: 1000 BYTES: 4194304 SECONDS: 18.211958 MBytes/sec: 460.609893 LOOPS: 1000 BYTES: 8388608 SECONDS: 35.421490 MBytes/sec: 473.645124 My questions are : a) Is OpenMPI doing in this case TCP/IP over IB ? (I guess so) b) Is it possible to improve significantly these values by changing the defaults ? I have used several mca btl parameters but without improving the maximum bandwith. For example : --mca btl mvapi --mca btl_mvapi_max_send_size 8388608 c) Is it possible that other IB hardware implementations have better performances with OpenMPI ? d) Is it possible to use specific IB drivers for optimal performance ? (should reach almost 800 MB/sec) Thank you very much for your help Best Regards, Jean Latour <>
Re: [OMPI users] Performance of ping-pong using OpenMPI over Infiniband
Following your advices and those in the FAQ pages, I have added the file $(HOME)/.openmpi/mca-params.conf with : btl_mvapi_flags=6 mpi_leave_pinned=1 pml_ob1_leave_pinned_pipeline=1 mpool_base_use_mem_hooks=1 The parameterbtl_mvapi_eager_limit gives the best results, when set to 8 K or 16 K. The pingpong test result is now : LOOPS: 1000 BYTES: 4096 SECONDS: 0.085643 MBytes/sec: 95.652825 LOOPS: 1000 BYTES: 8192 SECONDS: 0.050893 MBytes/sec: 321.931400 LOOPS: 1000 BYTES: 16384 SECONDS: 0.106791 MBytes/sec: 306.842281 LOOPS: 1000 BYTES: 32768 SECONDS: 0.154873 MBytes/sec: 423.159259 LOOPS: 1000 BYTES: 65536 SECONDS: 0.250849 MBytes/sec: 522.513526 LOOPS: 1000 BYTES: 131072 SECONDS: 0.443162 MBytes/sec: 591.530910 LOOPS: 1000 BYTES: 262144 SECONDS: 0.827640 MBytes/sec: 633.473448 LOOPS: 1000 BYTES: 524288 SECONDS: 1.596701 MBytes/sec: 656.714101 LOOPS: 1000 BYTES: 1048576 SECONDS: 3.134974 MBytes/sec: 668.953554 LOOPS: 1000 BYTES: 2097152 SECONDS: 6.210786 MBytes/sec: 675.325785 LOOPS: 1000 BYTES: 4194304 SECONDS: 12.384103 MBytes/sec: 677.369053 LOOPS: 1000 BYTES: 8388608 SECONDS: 27.377714 MBytes/sec: 612.805580 which is exactly what we can get also with mvapich on the same network. Since we do NOT have a PCI-X hardware, I believe this is the maximum we can get from the hardware. Thanks a lot for your explanations for this tunning of OpenMPI Best Regards, Jean George Bosilca wrote: On Thu, 16 Mar 2006, Jean Latour wrote: My questions are : a) Is OpenMPI doing in this case TCP/IP over IB ? (I guess so) If the path to the mvapi library is correct then Open MPI will use mvapi not TCP over IB. There is a simple way to check. "ompi_info --param btl mvapi" will print all the parameters attached to the mvapi driver. If there is no mvapi in the output, then mvapi was not correctly detected. But I don't think it's the case, because if I remember well we have a protection at configure time. If you specify one of the drivers and we're not able to correctly use the libraries, we will stop the configure. b) Is it possible to improve significantly these values by changing the defaults ? By default we are using a very conservative approach. We never leave the memory pinned down, and that decrease the performance for a ping-pong. There are pro and cons for that, too long to be explained here, but in general we're seeing better performance for real-life applications with our default approach, and that's our main goal. Now, if you want to get better performance for the ping-pong test please read the FAQ at http://www.open-mpi.org/faq/?category=infiniband. These are the 3 flags that affect the mvapi performance for the ping-pong case (add them in $(HOME)/.openmpi/mca-params.conf): btl_mvapi_flags=6 mpi_leave_pinned=1 pml_ob1_leave_pinned_pipeline=1 I have used several mca btl parameters but without improving the maximum bandwith. For example : --mca btl mvapi --mca btl_mvapi_max_send_size 8388608 It is difficult to improve the maximum bandwidth without the leave_pinned activated. But you can improve the bandwidth for medium size messages. Play with btl_mvapi_eager_limit to set the limit between short and rendez-vous protocol. "ompi_info --param btl mvapi" will give you a full list of parameters as well as their description. c) Is it possible that other IB hardware implementations have better performances with OpenMPI ? The maximum bandwidth depend on several factors. One of the most importants is the maximum bandwidth on your node bus. To reach 800 and more MB/s you definitively need a PCI-X 16 ... d) Is it possible to use specific IB drivers for optimal performance ? (should reach almost 800 MB/sec) Once the 3 options are set, you should see an improvement on the bandwidth. Let me know if it does not solve your problems. george. "We must accept finite disappointment, but we must never lose infinite hope." Martin Luther King ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users <>
Re: [OMPI users] How to establish communication between two separate COM WORLD
Hello, It seems to me there is only one way to create a communication between two MPI_COMM_WORLD : use MPI_Open_Port with a specific IP + port address, and then MPI_comm_connect / MPI_comm_accept. In order to ease the port number communication, the use of MPI_publish-name / MPI_lookup_name is also possible with the constraint that the "publish" must be done before the "lookup", and this involves some synchronization between the processes anyway. Simple examples can be found in the handbook on MPI : "Using MPI-2" by William Gropp et al. Best Regards, Jean Ali Eghlima wrote: Hello, I have read MPI-2 documents as well as FAQ. I am confused as the best way to establish communication between two MPI_COMM_WORLD which has been created by two mpiexec calls on the same node. mpiexec -conf config1 This start 20 processes on 7 nodes mpiexec -conf config2 This start 18 processes on 5 nodes I do appreciate any comments or pointer to a document or example. Thanks Ali, ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users <>