Yeah, I'm seeing the hang as well when running across multiple machines. Let me dig a little and get this fixed.
Thanks Ralph On Aug 28, 2012, at 4:51 PM, Brian Budge <brian.bu...@gmail.com> wrote: > Hmmm, I went to the build directories of openmpi for my two machines, > went into the orte/test/mpi directory and made the executables on both > machines. I set the hostsfile in the env variable on the "master" > machine. > > Here's the output: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > ./simple_spawn > Parent [pid 97504] starting up! > 0 completed MPI_Init > Parent [pid 97504] about to spawn! > Parent [pid 97507] starting up! > Parent [pid 97508] starting up! > Parent [pid 30626] starting up! > ^C > zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn > > I had to ^C to kill the hung process. > > When I run using mpirun: > > OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile > mpirun -np 1 ./simple_spawn > Parent [pid 97511] starting up! > 0 completed MPI_Init > Parent [pid 97511] about to spawn! > Parent [pid 97513] starting up! > Parent [pid 30762] starting up! > Parent [pid 30764] starting up! > Parent done with spawn > Parent sending message to child > 1 completed MPI_Init > Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 > 0 completed MPI_Init > Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 > 2 completed MPI_Init > Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 > Child 1 disconnected > Child 0 received msg: 38 > Child 0 disconnected > Parent disconnected > Child 2 disconnected > 97511: exiting > 97513: exiting > 30762: exiting > 30764: exiting > > As you can see, I'm using openmpi v 1.6.1. I just barely freshly > installed on both machines using the default configure options. > > Thanks for all your help. > > Brian > > On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote: >> Looks to me like it didn't find your executable - could be a question of >> where it exists relative to where you are running. If you look in your OMPI >> source tree at the orte/test/mpi directory, you'll see an example program >> "simple_spawn.c" there. Just "make simple_spawn" and execute that with your >> default hostfile set - does it work okay? >> >> It works fine for me, hence the question. >> >> Also, what OMPI version are you using? >> >> On Aug 28, 2012, at 4:25 PM, Brian Budge <brian.bu...@gmail.com> wrote: >> >>> I see. Okay. So, I just tried removing the check for universe size, >>> and set the universe size to 2. Here's my output: >>> >>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file >>> base/plm_base_receive.c at line 253 >>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified >>> application failed to start in file dpm_orte.c at line 785 >>> >>> The corresponding run with mpirun still works. >>> >>> Thanks, >>> Brian >>> >>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> I see the issue - it's here: >>>> >>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag); >>>>> >>>>> if(!flag) { >>>>> std::cerr << "no universe size" << std::endl; >>>>> return -1; >>>>> } >>>>> universeSize = *puniverseSize; >>>>> if(universeSize == 1) { >>>>> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >>>>> } >>>> >>>> The universe size is set to 1 on a singleton because the attribute gets >>>> set at the beginning of time - we haven't any way to go back and change >>>> it. The sequence of events explains why. The singleton starts up and sets >>>> its attributes, including universe_size. It also spins off an orte daemon >>>> to act as its own private "mpirun" in case you call comm_spawn. At this >>>> point, however, no hostfile has been read - the singleton is just an MPI >>>> proc doing its own thing, and the orte daemon is just sitting there on >>>> "stand-by". >>>> >>>> When your app calls comm_spawn, then the orte daemon gets called to launch >>>> the new procs. At that time, it (not the original singleton!) reads the >>>> hostfile to find out how many nodes are around, and then does the launch. >>>> >>>> You are trying to check the number of nodes from within the singleton, >>>> which won't work - it has no way of discovering that info. >>>> >>>> >>>> >>>> >>>> On Aug 28, 2012, at 2:38 PM, Brian Budge <brian.bu...@gmail.com> wrote: >>>> >>>>>> echo hostsfile >>>>> localhost >>>>> budgeb-sandybridge >>>>> >>>>> Thanks, >>>>> Brian >>>>> >>>>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> Hmmm...what is in your "hostsfile"? >>>>>> >>>>>> On Aug 28, 2012, at 2:33 PM, Brian Budge <brian.bu...@gmail.com> wrote: >>>>>> >>>>>>> Hi Ralph - >>>>>>> >>>>>>> Thanks for confirming this is possible. I'm trying this and currently >>>>>>> failing. Perhaps there's something I'm missing in the code to make >>>>>>> this work. Here are the two instantiations and their outputs: >>>>>>> >>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>>>>>> cannot start slaves... not enough nodes >>>>>>> >>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe >>>>>>> master spawned 1 slaves... >>>>>>> slave responding... >>>>>>> >>>>>>> >>>>>>> The code: >>>>>>> >>>>>>> //master.cpp >>>>>>> #include <mpi.h> >>>>>>> #include <boost/filesystem.hpp> >>>>>>> #include <iostream> >>>>>>> >>>>>>> int main(int argc, char **args) { >>>>>>> int worldSize, universeSize, *puniverseSize, flag; >>>>>>> >>>>>>> MPI_Comm everyone; //intercomm >>>>>>> boost::filesystem::path curPath = >>>>>>> boost::filesystem::absolute(boost::filesystem::current_path()); >>>>>>> >>>>>>> std::string toRun = (curPath / "slave_exe").string(); >>>>>>> >>>>>>> int ret = MPI_Init(&argc, &args); >>>>>>> >>>>>>> if(ret != MPI_SUCCESS) { >>>>>>> std::cerr << "failed init" << std::endl; >>>>>>> return -1; >>>>>>> } >>>>>>> >>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &worldSize); >>>>>>> >>>>>>> if(worldSize != 1) { >>>>>>> std::cerr << "too many masters" << std::endl; >>>>>>> } >>>>>>> >>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag); >>>>>>> >>>>>>> if(!flag) { >>>>>>> std::cerr << "no universe size" << std::endl; >>>>>>> return -1; >>>>>>> } >>>>>>> universeSize = *puniverseSize; >>>>>>> if(universeSize == 1) { >>>>>>> std::cerr << "cannot start slaves... not enough nodes" << std::endl; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> char *buf = (char*)alloca(toRun.size() + 1); >>>>>>> memcpy(buf, toRun.c_str(), toRun.size()); >>>>>>> buf[toRun.size()] = '\0'; >>>>>>> >>>>>>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, >>>>>>> 0, MPI_COMM_SELF, &everyone, >>>>>>> MPI_ERRCODES_IGNORE); >>>>>>> >>>>>>> std::cerr << "master spawned " << universeSize-1 << " slaves..." >>>>>>> << std::endl; >>>>>>> >>>>>>> MPI_Finalize(); >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> //slave.cpp >>>>>>> #include <mpi.h> >>>>>>> >>>>>>> int main(int argc, char **args) { >>>>>>> int size; >>>>>>> MPI_Comm parent; >>>>>>> MPI_Init(&argc, &args); >>>>>>> >>>>>>> MPI_Comm_get_parent(&parent); >>>>>>> >>>>>>> if(parent == MPI_COMM_NULL) { >>>>>>> std::cerr << "slave has no parent" << std::endl; >>>>>>> } >>>>>>> MPI_Comm_remote_size(parent, &size); >>>>>>> if(size != 1) { >>>>>>> std::cerr << "parent size is " << size << std::endl; >>>>>>> } >>>>>>> >>>>>>> std::cerr << "slave responding..." << std::endl; >>>>>>> >>>>>>> MPI_Finalize(); >>>>>>> >>>>>>> return 0; >>>>>>> } >>>>>>> >>>>>>> >>>>>>> Any ideas? Thanks for any help. >>>>>>> >>>>>>> Brian >>>>>>> >>>>>>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>>> It really is just that simple :-) >>>>>>>> >>>>>>>> On Aug 22, 2012, at 8:56 AM, Brian Budge <brian.bu...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Okay. Is there a tutorial or FAQ for setting everything up? Or is it >>>>>>>>> really just that simple? I don't need to run a copy of the orte >>>>>>>>> server somewhere? >>>>>>>>> >>>>>>>>> if my current ip is 192.168.0.1, >>>>>>>>> >>>>>>>>> 0 > echo 192.168.0.11 > /tmp/hostfile >>>>>>>>> 1 > echo 192.168.0.12 >> /tmp/hostfile >>>>>>>>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile >>>>>>>>> 3 > ./mySpawningExe >>>>>>>>> >>>>>>>>> At this point, mySpawningExe will be the master, running on >>>>>>>>> 192.168.0.1, and I can have spawned, for example, childExe on >>>>>>>>> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and >>>>>>>>> childExe2 on 192.168.0.12? >>>>>>>>> >>>>>>>>> Thanks for the help. >>>>>>>>> >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>>> Sure, that's still true on all 1.3 or above releases. All you need >>>>>>>>>> to do is set the hostfile envar so we pick it up: >>>>>>>>>> >>>>>>>>>> OMPI_MCA_orte_default_hostfile=<foo> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Aug 21, 2012, at 7:23 PM, Brian Budge <brian.bu...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi. I know this is an old thread, but I'm curious if there are any >>>>>>>>>>> tutorials describing how to set this up? Is this still available on >>>>>>>>>>> newer open mpi versions? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Brian >>>>>>>>>>> >>>>>>>>>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain <r...@lanl.gov> wrote: >>>>>>>>>>>> Hi Elena >>>>>>>>>>>> >>>>>>>>>>>> I'm copying this to the user list just to correct a mis-statement >>>>>>>>>>>> on my part >>>>>>>>>>>> in an earlier message that went there. I had stated that a >>>>>>>>>>>> singleton could >>>>>>>>>>>> comm_spawn onto other nodes listed in a hostfile by setting an >>>>>>>>>>>> environmental >>>>>>>>>>>> variable that pointed us to the hostfile. >>>>>>>>>>>> >>>>>>>>>>>> This is incorrect in the 1.2 code series. That series does not >>>>>>>>>>>> allow >>>>>>>>>>>> singletons to read a hostfile at all. Hence, any comm_spawn done >>>>>>>>>>>> by a >>>>>>>>>>>> singleton can only launch child processes on the singleton's local >>>>>>>>>>>> host. >>>>>>>>>>>> >>>>>>>>>>>> This situation has been corrected for the upcoming 1.3 code >>>>>>>>>>>> series. For the >>>>>>>>>>>> 1.2 series, though, you will have to do it via an mpirun command >>>>>>>>>>>> line. >>>>>>>>>>>> >>>>>>>>>>>> Sorry for the confusion - I sometimes have too many code families >>>>>>>>>>>> to keep >>>>>>>>>>>> straight in this old mind! >>>>>>>>>>>> >>>>>>>>>>>> Ralph >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 1/4/08 5:10 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello Ralph, >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you very much for the explanations. >>>>>>>>>>>>> But I still do not get it running... >>>>>>>>>>>>> >>>>>>>>>>>>> For the case >>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>> my_master.exe >>>>>>>>>>>>> everything works. >>>>>>>>>>>>> >>>>>>>>>>>>> For the case >>>>>>>>>>>>> ./my_master.exe >>>>>>>>>>>>> it does not. >>>>>>>>>>>>> >>>>>>>>>>>>> I did: >>>>>>>>>>>>> - create my_hostfile and put it in the $HOME/.openmpi/components/ >>>>>>>>>>>>> my_hostfile : >>>>>>>>>>>>> bollenstreek slots=2 max_slots=3 >>>>>>>>>>>>> octocore01 slots=8 max_slots=8 >>>>>>>>>>>>> octocore02 slots=8 max_slots=8 >>>>>>>>>>>>> clstr000 slots=2 max_slots=3 >>>>>>>>>>>>> clstr001 slots=2 max_slots=3 >>>>>>>>>>>>> clstr002 slots=2 max_slots=3 >>>>>>>>>>>>> clstr003 slots=2 max_slots=3 >>>>>>>>>>>>> clstr004 slots=2 max_slots=3 >>>>>>>>>>>>> clstr005 slots=2 max_slots=3 >>>>>>>>>>>>> clstr006 slots=2 max_slots=3 >>>>>>>>>>>>> clstr007 slots=2 max_slots=3 >>>>>>>>>>>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in >>>>>>>>>>>>> .tcshrc and >>>>>>>>>>>>> then source .tcshrc) >>>>>>>>>>>>> - in my_master.cpp I did >>>>>>>>>>>>> MPI_Info info1; >>>>>>>>>>>>> MPI_Info_create(&info1); >>>>>>>>>>>>> char* hostname = >>>>>>>>>>>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02"; >>>>>>>>>>>>> MPI_Info_set(info1, "host", hostname); >>>>>>>>>>>>> >>>>>>>>>>>>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, >>>>>>>>>>>>> 0, >>>>>>>>>>>>> MPI_ERRCODES_IGNORE); >>>>>>>>>>>>> >>>>>>>>>>>>> - After I call the executable, I've got this error message >>>>>>>>>>>>> >>>>>>>>>>>>> bollenstreek: > ./my_master >>>>>>>>>>>>> number of processes to run: 1 >>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>> Some of the requested hosts are not included in the current >>>>>>>>>>>>> allocation for >>>>>>>>>>>>> the application: >>>>>>>>>>>>> ./childexe >>>>>>>>>>>>> The requested hosts were: >>>>>>>>>>>>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02 >>>>>>>>>>>>> >>>>>>>>>>>>> Verify that you have mapped the allocated resources properly >>>>>>>>>>>>> using the >>>>>>>>>>>>> --host specification. >>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in >>>>>>>>>>>>> file >>>>>>>>>>>>> base/rmaps_base_support_fns.c at line 225 >>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in >>>>>>>>>>>>> file >>>>>>>>>>>>> rmaps_rr.c at line 478 >>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in >>>>>>>>>>>>> file >>>>>>>>>>>>> base/rmaps_base_map_job.c at line 210 >>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in >>>>>>>>>>>>> file >>>>>>>>>>>>> rmgr_urm.c at line 372 >>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in >>>>>>>>>>>>> file >>>>>>>>>>>>> communicator/comm_dyn.c at line 608 >>>>>>>>>>>>> >>>>>>>>>>>>> Did I miss something? >>>>>>>>>>>>> Thanks for help! >>>>>>>>>>>>> >>>>>>>>>>>>> Elena >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov] >>>>>>>>>>>>> Sent: Tuesday, December 18, 2007 3:50 PM >>>>>>>>>>>>> To: Elena Zhebel; Open MPI Users <us...@open-mpi.org> >>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>> configuration >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 12/18/07 7:35 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks a lot! Now it works! >>>>>>>>>>>>>> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and >>>>>>>>>>>>>> pass >>>>>>>>>>>>> MPI_Info >>>>>>>>>>>>>> Key to the Spawn function! >>>>>>>>>>>>>> >>>>>>>>>>>>>> One more question: is it necessary to start my "master" program >>>>>>>>>>>>>> with >>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>>> my_master.exe ? >>>>>>>>>>>>> >>>>>>>>>>>>> No, it isn't necessary - assuming that my_master_host is the >>>>>>>>>>>>> first host >>>>>>>>>>>>> listed in your hostfile! If you are only executing one >>>>>>>>>>>>> my_master.exe (i.e., >>>>>>>>>>>>> you gave -n 1 to mpirun), then we will automatically map that >>>>>>>>>>>>> process onto >>>>>>>>>>>>> the first host in your hostfile. >>>>>>>>>>>>> >>>>>>>>>>>>> If you want my_master.exe to go on someone other than the first >>>>>>>>>>>>> host in the >>>>>>>>>>>>> file, then you have to give us the -host option. >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are there other possibilities for easy start? >>>>>>>>>>>>>> I would say just to run ./my_master.exe , but then the master >>>>>>>>>>>>>> process >>>>>>>>>>>>> doesn't >>>>>>>>>>>>>> know about the available in the network hosts. >>>>>>>>>>>>> >>>>>>>>>>>>> You can set the hostfile parameter in your environment instead of >>>>>>>>>>>>> on the >>>>>>>>>>>>> command line. Just set OMPI_MCA_rds_hostfile_path = my.hosts. >>>>>>>>>>>>> >>>>>>>>>>>>> You can then just run ./my_master.exe on the host where you want >>>>>>>>>>>>> the master >>>>>>>>>>>>> to reside - everything should work the same. >>>>>>>>>>>>> >>>>>>>>>>>>> Just as an FYI: the name of that environmental variable is going >>>>>>>>>>>>> to change >>>>>>>>>>>>> in the 1.3 release, but everything will still work the same. >>>>>>>>>>>>> >>>>>>>>>>>>> Hope that helps >>>>>>>>>>>>> Ralph >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>>>> Elena >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov] >>>>>>>>>>>>>> Sent: Monday, December 17, 2007 5:49 PM >>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>; Elena Zhebel >>>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>>> configuration >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 12/17/07 8:19 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello Ralph, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for your answer. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm using OpenMPI 1.2.3. , compiler glibc232, Linux Suse 10.0. >>>>>>>>>>>>>>> My "master" executable runs only on the one local host, then it >>>>>>>>>>>>>>> spawns >>>>>>>>>>>>>>> "slaves" (with MPI::Intracomm::Spawn). >>>>>>>>>>>>>>> My question was: how to determine the hosts where these >>>>>>>>>>>>>>> "slaves" will be >>>>>>>>>>>>>>> spawned? >>>>>>>>>>>>>>> You said: "You have to specify all of the hosts that can be >>>>>>>>>>>>>>> used by >>>>>>>>>>>>>>> your job >>>>>>>>>>>>>>> in the original hostfile". How can I specify the host file? I >>>>>>>>>>>>>>> can not >>>>>>>>>>>>>>> find it >>>>>>>>>>>>>>> in the documentation. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hmmm...sorry about the lack of documentation. I always assumed >>>>>>>>>>>>>> that the MPI >>>>>>>>>>>>>> folks in the project would document such things since it has >>>>>>>>>>>>>> little to do >>>>>>>>>>>>>> with the underlying run-time, but I guess that fell through the >>>>>>>>>>>>>> cracks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are two parts to your question: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. how to specify the hosts to be used for the entire job. I >>>>>>>>>>>>>> believe that >>>>>>>>>>>>> is >>>>>>>>>>>>>> somewhat covered here: >>>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run >>>>>>>>>>>>>> >>>>>>>>>>>>>> That FAQ tells you what a hostfile should look like, though you >>>>>>>>>>>>>> may already >>>>>>>>>>>>>> know that. Basically, we require that you list -all- of the >>>>>>>>>>>>>> nodes that both >>>>>>>>>>>>>> your master and slave programs will use. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. how to specify which nodes are available for the master, and >>>>>>>>>>>>>> which for >>>>>>>>>>>>>> the slave. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You would specify the host for your master on the mpirun command >>>>>>>>>>>>>> line with >>>>>>>>>>>>>> something like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>>> my_master.exe >>>>>>>>>>>>>> >>>>>>>>>>>>>> This directs Open MPI to map that specified executable on the >>>>>>>>>>>>>> specified >>>>>>>>>>>>> host >>>>>>>>>>>>>> - note that my_master_host must have been in my_hostfile. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Inside your master, you would create an MPI_Info key "host" that >>>>>>>>>>>>>> has a >>>>>>>>>>>>> value >>>>>>>>>>>>>> consisting of a string "host1,host2,host3" identifying the hosts >>>>>>>>>>>>>> you want >>>>>>>>>>>>>> your slave to execute upon. Those hosts must have been included >>>>>>>>>>>>>> in >>>>>>>>>>>>>> my_hostfile. Include that key in the MPI_Info array passed to >>>>>>>>>>>>>> your Spawn. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We don't currently support providing a hostfile for the slaves >>>>>>>>>>>>>> (as opposed >>>>>>>>>>>>>> to the host-at-a-time string above). This may become available >>>>>>>>>>>>>> in a future >>>>>>>>>>>>>> release - TBD. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hope that helps >>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>> From: users-boun...@open-mpi.org >>>>>>>>>>>>>>> [mailto:users-boun...@open-mpi.org] On >>>>>>>>>>>>>>> Behalf Of Ralph H Castain >>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 3:31 PM >>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 12/12/07 5:46 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm working on a MPI application where I'm using OpenMPI >>>>>>>>>>>>>>>> instead of >>>>>>>>>>>>>>>> MPICH. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In my "master" program I call the function >>>>>>>>>>>>>>>> MPI::Intracomm::Spawn which >>>>>>>>>>>>>>> spawns >>>>>>>>>>>>>>>> "slave" processes. It is not clear for me how to spawn the >>>>>>>>>>>>>>>> "slave" >>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>> over the network. Currently "master" creates "slaves" on the >>>>>>>>>>>>>>>> same >>>>>>>>>>>>>>>> host. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If I use 'mpirun --hostfile openmpi.hosts' then processes are >>>>>>>>>>>>>>>> spawn >>>>>>>>>>>>>>>> over >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> network as expected. But now I need to spawn processes over the >>>>>>>>>>>>>>>> network >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>> my own executable using MPI::Intracomm::Spawn, how can I >>>>>>>>>>>>>>>> achieve it? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not sure from your description exactly what you are trying >>>>>>>>>>>>>>> to do, >>>>>>>>>>>>>>> nor in >>>>>>>>>>>>>>> what environment this is all operating within or what version >>>>>>>>>>>>>>> of Open >>>>>>>>>>>>>>> MPI >>>>>>>>>>>>>>> you are using. Setting aside the environment and version issue, >>>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>> guessing >>>>>>>>>>>>>>> that you are running your executable over some specified set of >>>>>>>>>>>>>>> hosts, >>>>>>>>>>>>>>> but >>>>>>>>>>>>>>> want to provide a different hostfile that specifies the hosts >>>>>>>>>>>>>>> to be >>>>>>>>>>>>>>> used for >>>>>>>>>>>>>>> the "slave" processes. Correct? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If that is correct, then I'm afraid you can't do that in any >>>>>>>>>>>>>>> version >>>>>>>>>>>>>>> of Open >>>>>>>>>>>>>>> MPI today. You have to specify all of the hosts that can be >>>>>>>>>>>>>>> used by >>>>>>>>>>>>>>> your job >>>>>>>>>>>>>>> in the original hostfile. You can then specify a subset of >>>>>>>>>>>>>>> those hosts >>>>>>>>>>>>>>> to be >>>>>>>>>>>>>>> used by your original "master" program, and then specify a >>>>>>>>>>>>>>> different >>>>>>>>>>>>>>> subset >>>>>>>>>>>>>>> to be used by the "slaves" when calling Spawn. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> But the system requires that you tell it -all- of the hosts >>>>>>>>>>>>>>> that are >>>>>>>>>>>>>>> going >>>>>>>>>>>>>>> to be used at the beginning of the job. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> At the moment, there is no plan to remove that requirement, >>>>>>>>>>>>>>> though >>>>>>>>>>>>>>> there has >>>>>>>>>>>>>>> been occasional discussion about doing so at some point in the >>>>>>>>>>>>>>> future. >>>>>>>>>>>>>>> No >>>>>>>>>>>>>>> promises that it will happen, though - managed environments, in >>>>>>>>>>>>>>> particular, >>>>>>>>>>>>>>> currently object to the idea of changing the allocation >>>>>>>>>>>>>>> on-the-fly. We >>>>>>>>>>>>>>> may, >>>>>>>>>>>>>>> though, make a provision for purely hostfile-based environments >>>>>>>>>>>>>>> (i.e., >>>>>>>>>>>>>>> unmanaged) at some time in the future. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks in advance for any help. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users