Hi Ralph, thank you very much for the detailed response.
I have to apologize I was not clear: I would like to use the MPI_spawn_multiple function. (I've attached the example program I use) . In any case I tryed your test program, just compling it with: /home/fandreasi/openmpi-1.7/bin/mpicc loop_spawn.c -o loop_spawn /home/fandreasi/openmpi-1.7/bin/mpicc loop_child.c -o loop_child and execute it on a single machine with /home/fandreasi/openmpi-1.7/bin/mpiexec ./loop_spawn ./loop_child but it hungs at different loop iterations after printing: "Child 26833:exiting" but looking at the top both the process (loop_spawn and loop_child) are still alive. I'm starting thinking that I've some environment setting not correct or I need to compile OpenMPI with some options. I compile it just setting the --prefix option to the ./configure. Do I need to do something else ? I have a linux Centos 4, 64 bits machine, with gcc 3.4. I think that this is my main problem now. Just to answer to other topics (minor): - Regardin version mismatch I use a linux cluster where the /home/ directory is shared among the compute nodes, and I've edited by .bashrc and .bashprofile to export the correct LD_LIBRARY_PATH. - thank you for the usefull trick about svn. Thank you very much !!! Federico. Il giorno 05 marzo 2011 19:05, Ralph Castain <r...@open-mpi.org> ha scritto: > Hi Federico > > I tested the trunk today and it works fine for me - I let it spin for 1000 > cycles without issue. My test program is essentially identical to what you > describe - you can see it in the orte/test/mpi directory. The "master" is > loop_spawn.c, and the "slave" is loop_child.c. I only tested it on a single > machine, though - will have to test multi-machine later. You might see if > that makes a difference. > > The error you report in your attachment is a classic symptom of mismatched > versions. Remember, we don't forward your ld_lib_path, so it has to be > correct on your remote machine. > > As for r22794 - we don't keep anything that old on our web site. If you > want to build it, the best way to get the code is to do a subversion > checkout of the developer's trunk at that revision level: > > svn co -r 22794 http://svn.open-mpi.org/svn/ompi/trunk > > Remember to run autogen before configure. > > > On Mar 4, 2011, at 4:43 AM, Federico Golfrè Andreasi wrote: > > > Hi Ralph, > > I'm getting stuck with spawning stuff, > > I've downloaded the snapshot from the trunk of 1st of March ( > openmpi-1.7a1r24472.tar.bz2), > I'm testing using a small program that does the following: > - master program starts and each rank prints his hostsname > - master program spawn a slave program with the same size > - each rank of the slave (spawned) program prints his hostname > - end > Not always he is able to complete the progam run, two different behaviour: > 1. not all the slave print their hostname and the program ends suddenly > 2. both program ends correctly but orted demon is still alive and I need > to press crtl-c to exit > > > I've tryed to recompile my test program with a previous snapshot > (openmpi-1.7a1r22794.tar.bz2) > where I have only the compiled version of OpenMPI (in another machine). > It gives me an error before starting (I've attacehd) > Surfing on the FAQ I found some tip and I verified to compile the program > with the correct OpenMPI version, > that the LD_LIBRARY_PATH is consistent. > So I would like to re-compile the openmpi-1.7a1r22794.tar.bz2 but where > can I found it ? > > > Thank you, > Federico > > > > > > > > > > > Il giorno 23 febbraio 2011 03:43, Ralph Castain <rhc.open...@gmail.com> ha > scritto: > >> Apparently not. I will investigate when I return from vacation next week. >> >> >> Sent from my iPad >> >> On Feb 22, 2011, at 12:42 AM, Federico Golfrè Andreasi < >> federico.gol...@gmail.com> wrote: >> >> Hi Ralf, >> >> I've tested spawning with the OpenMPI 1.5 release but that fix is not >> there. >> Are you sure you've added it ? >> >> Thank you, >> Federico >> >> >> >> 2010/10/19 Ralph Castain < <r...@open-mpi.org>r...@open-mpi.org> >> >>> The fix should be there - just didn't get mentioned. >>> >>> Let me know if it isn't and I'll ensure it is in the next one...but I'd >>> be very surprised if it isn't already in there. >>> >>> >>> On Oct 19, 2010, at 3:03 AM, Federico Golfrè Andreasi wrote: >>> >>> Hi Ralf ! >>> >>> I saw that the new realease 1.5 is out. >>> I didn't found this fix in the "list of changes", is it present but not >>> mentioned since is a minor fix ? >>> >>> Thank you, >>> Federico >>> >>> >>> >>> 2010/4/1 Ralph Castain < <r...@open-mpi.org>r...@open-mpi.org> >>> >>>> Hi there! >>>> >>>> It will be in the 1.5.0 release, but not 1.4.2 (couldn't backport the >>>> fix). I understand that will come out sometime soon, but no firm date has >>>> been set. >>>> >>>> >>>> On Apr 1, 2010, at 4:05 AM, Federico Golfrè Andreasi wrote: >>>> >>>> Hi Ralph, >>>> >>>> >>>> I've downloaded and tested the openmpi-1.7a1r22817 snapshot, >>>> and it works fine for (multiple) spawning more than 128 processes. >>>> >>>> That fix will be included in the next release of OpenMPI, right ? >>>> Do you when it will be released ? Or where I can find that info ? >>>> >>>> Thank you, >>>> Federico >>>> >>>> >>>> >>>> 2010/3/1 Ralph Castain < <r...@open-mpi.org>r...@open-mpi.org> >>>> >>>>> <http://www.open-mpi.org/nightly/trunk/> >>>>> http://www.open-mpi.org/nightly/trunk/ >>>>> >>>>> I'm not sure this patch will solve your problem, but it is worth a try. >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> <us...@open-mpi.org>us...@open-mpi.org >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> <us...@open-mpi.org>us...@open-mpi.org >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> <us...@open-mpi.org>us...@open-mpi.org >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> <us...@open-mpi.org>us...@open-mpi.org >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> > <OpenMPI.error> > > >
/* * * PROGRAM TEST for MPI_COMM_SPAWN_MULTIPLE * * prototype program that simulate the spawn process needed for the SJI MT Domain Manager * the manager must be executed with the exec command of the worker as first input parameter * * updated to the OpenMPI-1.4.0 * * * program MASTER * * Author: Federico Golfre' Andreasi * Created: 28/01/2010 * */ #include "mpi.h" #include <iostream> using namespace std; int main ( int argc, char* argv[] ) { int rank,size; char local_host[MPI_MAX_PROCESSOR_NAME]; int local_host_len; MPI_Comm intercomm; // *** MPI SESSION *** // Initialization of MPI session MPI_Init(&argc,&argv); // *** GET INFORMATION ABOUT THE WORLD COMMUNICATOR *** // Get the size and the rank within the Comm MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); if (rank==0) cout<<"\n***** MASTER (SPAWNING) ****\n"; MPI_Barrier(MPI_COMM_WORLD); // Get the name of the host MPI_Get_processor_name(local_host,&local_host_len); cout<<" Rank "<<rank<<" runs on host: "<<local_host<<"\n"; MPI_Barrier(MPI_COMM_WORLD); // *** DEFINITION OF VARIABLES *** char *commands[size]; int procs[size]; MPI_Info infos[size]; char hosts[size][MPI_MAX_PROCESSOR_NAME]; // Gather to All // MPI_Allgather(local_host,MPI_MAX_PROCESSOR_NAME,MPI_CHAR,hosts,MPI_MAX_PROCESSOR_NAME,MPI_CHAR,MPI_COMM_WORLD); MPI_Gather (local_host,MPI_MAX_PROCESSOR_NAME,MPI_CHAR,hosts,MPI_MAX_PROCESSOR_NAME,MPI_CHAR,0,MPI_COMM_WORLD); if (rank==0) { for (int i=0;i<size;i++) { commands[i]=argv[1]; procs[i]=1; MPI_Info_create(&infos[i]); MPI_Info_set(infos[i],"host",hosts[i]); cout<<" child "<<i<<" will go on host "<<hosts[i]<<endl; } } // *** EXECUTING THE SLAVE PROGRAM *** // Barrier MPI_Barrier(MPI_COMM_WORLD); if ( rank==0 ) cout<<"\t spawning the slave program "<<argv[1]<<" ...\n"; // Launching the slave and check for some errors int spawn_errors[size]; MPI_Comm_spawn_multiple(size,commands,MPI_ARGVS_NULL,procs,infos,0,MPI_COMM_WORLD,&intercomm,spawn_errors); if (rank==0) { for ( int i=0;i<size;i++ ) { if ( spawn_errors[i]!=MPI_SUCCESS ) cout<<"ERROR with spawning process number "<<i<<endl; } } // Destroy all the Infos object if (rank==0) { for (int i=0;i<size;i++) MPI_Info_free(&infos[i]); } // Inform that the spawning process is completed if (rank==0) cout<<"\t spawning process complete;\n"; // *** END OF THE PROGRAM AND PETSC SESSION *** if (rank==0) cout<<"**** THE MASTER END ****\n\n"; MPI_Finalize(); return EXIT_SUCCESS; }
/* * * PROGRAM TEST for MPI_COMM_SPAWN_MULTIPLE * * prototype program that simulate the spawn process needed for the SJI MT Domain Manager * * updated to the OpenMPI-1.4.0 * * * program SLAVE * * Author: Federico Golfre' Andreasi * Created: 28/01/2010 * */ #include "mpi.h" #include <iostream> using namespace std; #define MAX_PROCESSOR_NAME 255 int main (int argc, char *argv[]) { int worker_rank,worker_size; char local_host[MAX_PROCESSOR_NAME]; int local_host_len; // *** MPI SESSION *** // Initialization of MPI session MPI_Init(&argc,&argv); // *** GET INFORMATION ABOUT THE WORKER WORLD COMMUNICATOR *** // Get the size and the rank within the worker comm MPI_Comm_rank(MPI_COMM_WORLD,&worker_rank); MPI_Comm_size(MPI_COMM_WORLD,&worker_size); if (worker_rank==0) cout<<"\n***** SLAVE (SPAWNED) ****\n"; MPI_Barrier(MPI_COMM_WORLD); // Get the name of the host MPI_Get_processor_name(local_host,&local_host_len); cout<<" Rank "<<worker_rank<<" runs on host: "<<local_host<<" (argc="<<argc<<")\n"; MPI_Barrier(MPI_COMM_WORLD); // *** END OF PETSC SESSION *** if (worker_rank==0) cout<<"**** THE SLAVE END ****\n\n"; MPI_Finalize(); return EXIT_SUCCESS; }