Yeah, I'm seeing the hang as well when running across multiple machines. Let me 
dig a little and get this fixed.

Thanks
Ralph

On Aug 28, 2012, at 4:51 PM, Brian Budge <brian.bu...@gmail.com> wrote:

> Hmmm, I went to the build directories of openmpi for my two machines,
> went into the orte/test/mpi directory and made the executables on both
> machines.  I set the hostsfile in the env variable on the "master"
> machine.
> 
> Here's the output:
> 
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> ./simple_spawn
> Parent [pid 97504] starting up!
> 0 completed MPI_Init
> Parent [pid 97504] about to spawn!
> Parent [pid 97507] starting up!
> Parent [pid 97508] starting up!
> Parent [pid 30626] starting up!
> ^C
> zsh: interrupt  OMPI_MCA_orte_default_hostfile= ./simple_spawn
> 
> I had to ^C to kill the hung process.
> 
> When I run using mpirun:
> 
> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
> mpirun -np 1 ./simple_spawn
> Parent [pid 97511] starting up!
> 0 completed MPI_Init
> Parent [pid 97511] about to spawn!
> Parent [pid 97513] starting up!
> Parent [pid 30762] starting up!
> Parent [pid 30764] starting up!
> Parent done with spawn
> Parent sending message to child
> 1 completed MPI_Init
> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
> 0 completed MPI_Init
> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
> 2 completed MPI_Init
> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
> Child 1 disconnected
> Child 0 received msg: 38
> Child 0 disconnected
> Parent disconnected
> Child 2 disconnected
> 97511: exiting
> 97513: exiting
> 30762: exiting
> 30764: exiting
> 
> As you can see, I'm using openmpi v 1.6.1.  I just barely freshly
> installed on both machines using the default configure options.
> 
> Thanks for all your help.
> 
>  Brian
> 
> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Looks to me like it didn't find your executable - could be a question of 
>> where it exists relative to where you are running. If you look in your OMPI 
>> source tree at the orte/test/mpi directory, you'll see an example program 
>> "simple_spawn.c" there. Just "make simple_spawn" and execute that with your 
>> default hostfile set - does it work okay?
>> 
>> It works fine for me, hence the question.
>> 
>> Also, what OMPI version are you using?
>> 
>> On Aug 28, 2012, at 4:25 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>> 
>>> I see.  Okay.  So, I just tried removing the check for universe size,
>>> and set the universe size to 2.  Here's my output:
>>> 
>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file
>>> base/plm_base_receive.c at line 253
>>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified
>>> application failed to start in file dpm_orte.c at line 785
>>> 
>>> The corresponding run with mpirun still works.
>>> 
>>> Thanks,
>>> Brian
>>> 
>>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> I see the issue - it's here:
>>>> 
>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>> 
>>>>> if(!flag) {
>>>>>     std::cerr << "no universe size" << std::endl;
>>>>>     return -1;
>>>>> }
>>>>> universeSize = *puniverseSize;
>>>>> if(universeSize == 1) {
>>>>>     std::cerr << "cannot start slaves... not enough nodes" << std::endl;
>>>>> }
>>>> 
>>>> The universe size is set to 1 on a singleton because the attribute gets 
>>>> set at the beginning of time - we haven't any way to go back and change 
>>>> it. The sequence of events explains why. The singleton starts up and sets 
>>>> its attributes, including universe_size. It also spins off an orte daemon 
>>>> to act as its own private "mpirun" in case you call comm_spawn. At this 
>>>> point, however, no hostfile has been read - the singleton is just an MPI 
>>>> proc doing its own thing, and the orte daemon is just sitting there on 
>>>> "stand-by".
>>>> 
>>>> When your app calls comm_spawn, then the orte daemon gets called to launch 
>>>> the new procs. At that time, it (not the original singleton!) reads the 
>>>> hostfile to find out how many nodes are around, and then does the launch.
>>>> 
>>>> You are trying to check the number of nodes from within the singleton, 
>>>> which won't work - it has no way of discovering that info.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Aug 28, 2012, at 2:38 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>> 
>>>>>> echo hostsfile
>>>>> localhost
>>>>> budgeb-sandybridge
>>>>> 
>>>>> Thanks,
>>>>> Brian
>>>>> 
>>>>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Hmmm...what is in your "hostsfile"?
>>>>>> 
>>>>>> On Aug 28, 2012, at 2:33 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi Ralph -
>>>>>>> 
>>>>>>> Thanks for confirming this is possible.  I'm trying this and currently
>>>>>>> failing.  Perhaps there's something I'm missing in the code to make
>>>>>>> this work.  Here are the two instantiations and their outputs:
>>>>>>> 
>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib 
>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>>>>>> cannot start slaves... not enough nodes
>>>>>>> 
>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib 
>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 ./master_exe
>>>>>>> master spawned 1 slaves...
>>>>>>> slave responding...
>>>>>>> 
>>>>>>> 
>>>>>>> The code:
>>>>>>> 
>>>>>>> //master.cpp
>>>>>>> #include <mpi.h>
>>>>>>> #include <boost/filesystem.hpp>
>>>>>>> #include <iostream>
>>>>>>> 
>>>>>>> int main(int argc, char **args) {
>>>>>>> int worldSize, universeSize, *puniverseSize, flag;
>>>>>>> 
>>>>>>> MPI_Comm everyone; //intercomm
>>>>>>> boost::filesystem::path curPath =
>>>>>>> boost::filesystem::absolute(boost::filesystem::current_path());
>>>>>>> 
>>>>>>> std::string toRun = (curPath / "slave_exe").string();
>>>>>>> 
>>>>>>> int ret = MPI_Init(&argc, &args);
>>>>>>> 
>>>>>>> if(ret != MPI_SUCCESS) {
>>>>>>>     std::cerr << "failed init" << std::endl;
>>>>>>>     return -1;
>>>>>>> }
>>>>>>> 
>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
>>>>>>> 
>>>>>>> if(worldSize != 1) {
>>>>>>>     std::cerr << "too many masters" << std::endl;
>>>>>>> }
>>>>>>> 
>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>>>> 
>>>>>>> if(!flag) {
>>>>>>>     std::cerr << "no universe size" << std::endl;
>>>>>>>     return -1;
>>>>>>> }
>>>>>>> universeSize = *puniverseSize;
>>>>>>> if(universeSize == 1) {
>>>>>>>     std::cerr << "cannot start slaves... not enough nodes" << std::endl;
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> char *buf = (char*)alloca(toRun.size() + 1);
>>>>>>> memcpy(buf, toRun.c_str(), toRun.size());
>>>>>>> buf[toRun.size()] = '\0';
>>>>>>> 
>>>>>>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL,
>>>>>>> 0, MPI_COMM_SELF, &everyone,
>>>>>>>                MPI_ERRCODES_IGNORE);
>>>>>>> 
>>>>>>> std::cerr << "master spawned " << universeSize-1 << " slaves..."
>>>>>>> << std::endl;
>>>>>>> 
>>>>>>> MPI_Finalize();
>>>>>>> 
>>>>>>> return 0;
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> //slave.cpp
>>>>>>> #include <mpi.h>
>>>>>>> 
>>>>>>> int main(int argc, char **args) {
>>>>>>> int size;
>>>>>>> MPI_Comm parent;
>>>>>>> MPI_Init(&argc, &args);
>>>>>>> 
>>>>>>> MPI_Comm_get_parent(&parent);
>>>>>>> 
>>>>>>> if(parent == MPI_COMM_NULL) {
>>>>>>>     std::cerr << "slave has no parent" << std::endl;
>>>>>>> }
>>>>>>> MPI_Comm_remote_size(parent, &size);
>>>>>>> if(size != 1) {
>>>>>>>     std::cerr << "parent size is " << size << std::endl;
>>>>>>> }
>>>>>>> 
>>>>>>> std::cerr << "slave responding..." << std::endl;
>>>>>>> 
>>>>>>> MPI_Finalize();
>>>>>>> 
>>>>>>> return 0;
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> Any ideas?  Thanks for any help.
>>>>>>> 
>>>>>>> Brian
>>>>>>> 
>>>>>>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain <r...@open-mpi.org> 
>>>>>>> wrote:
>>>>>>>> It really is just that simple :-)
>>>>>>>> 
>>>>>>>> On Aug 22, 2012, at 8:56 AM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Okay.  Is there a tutorial or FAQ for setting everything up?  Or is it
>>>>>>>>> really just that simple?  I don't need to run a copy of the orte
>>>>>>>>> server somewhere?
>>>>>>>>> 
>>>>>>>>> if my current ip is 192.168.0.1,
>>>>>>>>> 
>>>>>>>>> 0 > echo 192.168.0.11 > /tmp/hostfile
>>>>>>>>> 1 > echo 192.168.0.12 >> /tmp/hostfile
>>>>>>>>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile
>>>>>>>>> 3 > ./mySpawningExe
>>>>>>>>> 
>>>>>>>>> At this point, mySpawningExe will be the master, running on
>>>>>>>>> 192.168.0.1, and I can have spawned, for example, childExe on
>>>>>>>>> 192.168.0.11 and 192.168.0.12?  Or childExe1 on 192.168.0.11 and
>>>>>>>>> childExe2 on 192.168.0.12?
>>>>>>>>> 
>>>>>>>>> Thanks for the help.
>>>>>>>>> 
>>>>>>>>> Brian
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>> wrote:
>>>>>>>>>> Sure, that's still true on all 1.3 or above releases. All you need 
>>>>>>>>>> to do is set the hostfile envar so we pick it up:
>>>>>>>>>> 
>>>>>>>>>> OMPI_MCA_orte_default_hostfile=<foo>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 21, 2012, at 7:23 PM, Brian Budge <brian.bu...@gmail.com> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi.  I know this is an old thread, but I'm curious if there are any
>>>>>>>>>>> tutorials describing how to set this up?  Is this still available on
>>>>>>>>>>> newer open mpi versions?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Brian
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain <r...@lanl.gov> wrote:
>>>>>>>>>>>> Hi Elena
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm copying this to the user list just to correct a mis-statement 
>>>>>>>>>>>> on my part
>>>>>>>>>>>> in an earlier message that went there. I had stated that a 
>>>>>>>>>>>> singleton could
>>>>>>>>>>>> comm_spawn onto other nodes listed in a hostfile by setting an 
>>>>>>>>>>>> environmental
>>>>>>>>>>>> variable that pointed us to the hostfile.
>>>>>>>>>>>> 
>>>>>>>>>>>> This is incorrect in the 1.2 code series. That series does not 
>>>>>>>>>>>> allow
>>>>>>>>>>>> singletons to read a hostfile at all. Hence, any comm_spawn done 
>>>>>>>>>>>> by a
>>>>>>>>>>>> singleton can only launch child processes on the singleton's local 
>>>>>>>>>>>> host.
>>>>>>>>>>>> 
>>>>>>>>>>>> This situation has been corrected for the upcoming 1.3 code 
>>>>>>>>>>>> series. For the
>>>>>>>>>>>> 1.2 series, though, you will have to do it via an mpirun command 
>>>>>>>>>>>> line.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry for the confusion - I sometimes have too many code families 
>>>>>>>>>>>> to keep
>>>>>>>>>>>> straight in this old mind!
>>>>>>>>>>>> 
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 1/4/08 5:10 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you very much for the explanations.
>>>>>>>>>>>>> But I still do not get it running...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the case
>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host 
>>>>>>>>>>>>> my_master.exe
>>>>>>>>>>>>> everything works.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the case
>>>>>>>>>>>>> ./my_master.exe
>>>>>>>>>>>>> it does not.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I did:
>>>>>>>>>>>>> - create my_hostfile and put it in the $HOME/.openmpi/components/
>>>>>>>>>>>>> my_hostfile :
>>>>>>>>>>>>> bollenstreek slots=2 max_slots=3
>>>>>>>>>>>>> octocore01 slots=8  max_slots=8
>>>>>>>>>>>>> octocore02 slots=8  max_slots=8
>>>>>>>>>>>>> clstr000 slots=2 max_slots=3
>>>>>>>>>>>>> clstr001 slots=2 max_slots=3
>>>>>>>>>>>>> clstr002 slots=2 max_slots=3
>>>>>>>>>>>>> clstr003 slots=2 max_slots=3
>>>>>>>>>>>>> clstr004 slots=2 max_slots=3
>>>>>>>>>>>>> clstr005 slots=2 max_slots=3
>>>>>>>>>>>>> clstr006 slots=2 max_slots=3
>>>>>>>>>>>>> clstr007 slots=2 max_slots=3
>>>>>>>>>>>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I  put it in 
>>>>>>>>>>>>> .tcshrc and
>>>>>>>>>>>>> then source .tcshrc)
>>>>>>>>>>>>> - in my_master.cpp I did
>>>>>>>>>>>>> MPI_Info info1;
>>>>>>>>>>>>> MPI_Info_create(&info1);
>>>>>>>>>>>>> char* hostname =
>>>>>>>>>>>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
>>>>>>>>>>>>> MPI_Info_set(info1, "host", hostname);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 
>>>>>>>>>>>>> 0,
>>>>>>>>>>>>> MPI_ERRCODES_IGNORE);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - After I call the executable, I've got this error message
>>>>>>>>>>>>> 
>>>>>>>>>>>>> bollenstreek: > ./my_master
>>>>>>>>>>>>> number of processes to run: 1
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> Some of the requested hosts are not included in the current 
>>>>>>>>>>>>> allocation for
>>>>>>>>>>>>> the application:
>>>>>>>>>>>>> ./childexe
>>>>>>>>>>>>> The requested hosts were:
>>>>>>>>>>>>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Verify that you have mapped the allocated resources properly 
>>>>>>>>>>>>> using the
>>>>>>>>>>>>> --host specification.
>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> base/rmaps_base_support_fns.c at line 225
>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> rmaps_rr.c at line 478
>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> base/rmaps_base_map_job.c at line 210
>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> rmgr_urm.c at line 372
>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in 
>>>>>>>>>>>>> file
>>>>>>>>>>>>> communicator/comm_dyn.c at line 608
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did I miss something?
>>>>>>>>>>>>> Thanks for help!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Elena
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov]
>>>>>>>>>>>>> Sent: Tuesday, December 18, 2007 3:50 PM
>>>>>>>>>>>>> To: Elena Zhebel; Open MPI Users <us...@open-mpi.org>
>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster 
>>>>>>>>>>>>> configuration
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 12/18/07 7:35 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks a lot! Now it works!
>>>>>>>>>>>>>> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and 
>>>>>>>>>>>>>> pass
>>>>>>>>>>>>> MPI_Info
>>>>>>>>>>>>>> Key to the Spawn function!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One more question: is it necessary to start my "master" program 
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host 
>>>>>>>>>>>>>> my_master.exe ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> No, it isn't necessary - assuming that my_master_host is the 
>>>>>>>>>>>>> first host
>>>>>>>>>>>>> listed in your hostfile! If you are only executing one 
>>>>>>>>>>>>> my_master.exe (i.e.,
>>>>>>>>>>>>> you gave -n 1 to mpirun), then we will automatically map that 
>>>>>>>>>>>>> process onto
>>>>>>>>>>>>> the first host in your hostfile.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If you want my_master.exe to go on someone other than the first 
>>>>>>>>>>>>> host in the
>>>>>>>>>>>>> file, then you have to give us the -host option.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Are there other possibilities for easy start?
>>>>>>>>>>>>>> I would say just to run ./my_master.exe , but then the master 
>>>>>>>>>>>>>> process
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>> know about the available in the network hosts.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can set the hostfile parameter in your environment instead of 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> command line. Just set OMPI_MCA_rds_hostfile_path = my.hosts.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can then just run ./my_master.exe on the host where you want 
>>>>>>>>>>>>> the master
>>>>>>>>>>>>> to reside - everything should work the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just as an FYI: the name of that environmental variable is going 
>>>>>>>>>>>>> to change
>>>>>>>>>>>>> in the 1.3 release, but everything will still work the same.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov]
>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 5:49 PM
>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>; Elena Zhebel
>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster 
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 12/17/07 8:19 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you for your answer.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm using OpenMPI 1.2.3. , compiler glibc232, Linux Suse 10.0.
>>>>>>>>>>>>>>> My "master" executable runs only on the one local host, then it 
>>>>>>>>>>>>>>> spawns
>>>>>>>>>>>>>>> "slaves" (with MPI::Intracomm::Spawn).
>>>>>>>>>>>>>>> My question was: how to determine the hosts where these 
>>>>>>>>>>>>>>> "slaves" will be
>>>>>>>>>>>>>>> spawned?
>>>>>>>>>>>>>>> You said: "You have to specify all of the hosts that can be 
>>>>>>>>>>>>>>> used by
>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>> in the original hostfile". How can I specify the host file? I 
>>>>>>>>>>>>>>> can not
>>>>>>>>>>>>>>> find it
>>>>>>>>>>>>>>> in the documentation.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hmmm...sorry about the lack of documentation. I always assumed 
>>>>>>>>>>>>>> that the MPI
>>>>>>>>>>>>>> folks in the project would document such things since it has 
>>>>>>>>>>>>>> little to do
>>>>>>>>>>>>>> with the underlying run-time, but I guess that fell through the 
>>>>>>>>>>>>>> cracks.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are two parts to your question:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. how to specify the hosts to be used for the entire job. I 
>>>>>>>>>>>>>> believe that
>>>>>>>>>>>>> is
>>>>>>>>>>>>>> somewhat covered here:
>>>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That FAQ tells you what a hostfile should look like, though you 
>>>>>>>>>>>>>> may already
>>>>>>>>>>>>>> know that. Basically, we require that you list -all- of the 
>>>>>>>>>>>>>> nodes that both
>>>>>>>>>>>>>> your master and slave programs will use.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. how to specify which nodes are available for the master, and 
>>>>>>>>>>>>>> which for
>>>>>>>>>>>>>> the slave.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You would specify the host for your master on the mpirun command 
>>>>>>>>>>>>>> line with
>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host 
>>>>>>>>>>>>>> my_master.exe
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This directs Open MPI to map that specified executable on the 
>>>>>>>>>>>>>> specified
>>>>>>>>>>>>> host
>>>>>>>>>>>>>> - note that my_master_host must have been in my_hostfile.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Inside your master, you would create an MPI_Info key "host" that 
>>>>>>>>>>>>>> has a
>>>>>>>>>>>>> value
>>>>>>>>>>>>>> consisting of a string "host1,host2,host3" identifying the hosts 
>>>>>>>>>>>>>> you want
>>>>>>>>>>>>>> your slave to execute upon. Those hosts must have been included 
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> my_hostfile. Include that key in the MPI_Info array passed to 
>>>>>>>>>>>>>> your Spawn.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We don't currently support providing a hostfile for the slaves 
>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>> to the host-at-a-time string above). This may become available 
>>>>>>>>>>>>>> in a future
>>>>>>>>>>>>>> release - TBD.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: users-boun...@open-mpi.org 
>>>>>>>>>>>>>>> [mailto:users-boun...@open-mpi.org] On
>>>>>>>>>>>>>>> Behalf Of Ralph H Castain
>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 3:31 PM
>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>
>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster
>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 12/12/07 5:46 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm working on a MPI application where I'm using OpenMPI 
>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>> MPICH.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In my "master" program I call the function 
>>>>>>>>>>>>>>>> MPI::Intracomm::Spawn which
>>>>>>>>>>>>>>> spawns
>>>>>>>>>>>>>>>> "slave" processes. It is not clear for me how to spawn the 
>>>>>>>>>>>>>>>> "slave"
>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> over the network. Currently "master" creates "slaves" on the 
>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>> host.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If I use 'mpirun --hostfile openmpi.hosts' then processes are 
>>>>>>>>>>>>>>>> spawn
>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> network as expected. But now I need to spawn processes over the
>>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> my own executable using MPI::Intracomm::Spawn, how can I 
>>>>>>>>>>>>>>>> achieve it?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm not sure from your description exactly what you are trying 
>>>>>>>>>>>>>>> to do,
>>>>>>>>>>>>>>> nor in
>>>>>>>>>>>>>>> what environment this is all operating within or what version 
>>>>>>>>>>>>>>> of Open
>>>>>>>>>>>>>>> MPI
>>>>>>>>>>>>>>> you are using. Setting aside the environment and version issue, 
>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>> guessing
>>>>>>>>>>>>>>> that you are running your executable over some specified set of 
>>>>>>>>>>>>>>> hosts,
>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> want to provide a different hostfile that specifies the hosts 
>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>> used for
>>>>>>>>>>>>>>> the "slave" processes. Correct?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If that is correct, then I'm afraid you can't do that in any 
>>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>> of Open
>>>>>>>>>>>>>>> MPI today. You have to specify all of the hosts that can be 
>>>>>>>>>>>>>>> used by
>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>> in the original hostfile. You can then specify a subset of 
>>>>>>>>>>>>>>> those hosts
>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>> used by your original "master" program, and then specify a 
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> subset
>>>>>>>>>>>>>>> to be used by the "slaves" when calling Spawn.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> But the system requires that you tell it -all- of the hosts 
>>>>>>>>>>>>>>> that are
>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>> to be used at the beginning of the job.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> At the moment, there is no plan to remove that requirement, 
>>>>>>>>>>>>>>> though
>>>>>>>>>>>>>>> there has
>>>>>>>>>>>>>>> been occasional discussion about doing so at some point in the 
>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>> No
>>>>>>>>>>>>>>> promises that it will happen, though - managed environments, in
>>>>>>>>>>>>>>> particular,
>>>>>>>>>>>>>>> currently object to the idea of changing the allocation 
>>>>>>>>>>>>>>> on-the-fly. We
>>>>>>>>>>>>>>> may,
>>>>>>>>>>>>>>> though, make a provision for purely hostfile-based environments 
>>>>>>>>>>>>>>> (i.e.,
>>>>>>>>>>>>>>> unmanaged) at some time in the future.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks in advance for any help.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to