Thanks!
On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Yeah, I'm seeing the hang as well when running across multiple machines. Let
> me dig a little and get this fixed.
>
> Thanks
> Ralph
>
> On Aug 28, 2012, at 4:51 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>
>> Hmmm, I went to the build directories of openmpi for my two machines,
>> went into the orte/test/mpi directory and made the executables on both
>> machines. I set the hostsfile in the env variable on the "master"
>> machine.
>>
>> Here's the output:
>>
>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>> ./simple_spawn
>> Parent [pid 97504] starting up!
>> 0 completed MPI_Init
>> Parent [pid 97504] about to spawn!
>> Parent [pid 97507] starting up!
>> Parent [pid 97508] starting up!
>> Parent [pid 30626] starting up!
>> ^C
>> zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn
>>
>> I had to ^C to kill the hung process.
>>
>> When I run using mpirun:
>>
>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile
>> mpirun -np 1 ./simple_spawn
>> Parent [pid 97511] starting up!
>> 0 completed MPI_Init
>> Parent [pid 97511] about to spawn!
>> Parent [pid 97513] starting up!
>> Parent [pid 30762] starting up!
>> Parent [pid 30764] starting up!
>> Parent done with spawn
>> Parent sending message to child
>> 1 completed MPI_Init
>> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513
>> 0 completed MPI_Init
>> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762
>> 2 completed MPI_Init
>> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764
>> Child 1 disconnected
>> Child 0 received msg: 38
>> Child 0 disconnected
>> Parent disconnected
>> Child 2 disconnected
>> 97511: exiting
>> 97513: exiting
>> 30762: exiting
>> 30764: exiting
>>
>> As you can see, I'm using openmpi v 1.6.1. I just barely freshly
>> installed on both machines using the default configure options.
>>
>> Thanks for all your help.
>>
>> Brian
>>
>> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Looks to me like it didn't find your executable - could be a question of
>>> where it exists relative to where you are running. If you look in your OMPI
>>> source tree at the orte/test/mpi directory, you'll see an example program
>>> "simple_spawn.c" there. Just "make simple_spawn" and execute that with your
>>> default hostfile set - does it work okay?
>>>
>>> It works fine for me, hence the question.
>>>
>>> Also, what OMPI version are you using?
>>>
>>> On Aug 28, 2012, at 4:25 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>
>>>> I see. Okay. So, I just tried removing the check for universe size,
>>>> and set the universe size to 2. Here's my output:
>>>>
>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file
>>>> base/plm_base_receive.c at line 253
>>>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified
>>>> application failed to start in file dpm_orte.c at line 785
>>>>
>>>> The corresponding run with mpirun still works.
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> I see the issue - it's here:
>>>>>
>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>>>
>>>>>> if(!flag) {
>>>>>> std::cerr << "no universe size" << std::endl;
>>>>>> return -1;
>>>>>> }
>>>>>> universeSize = *puniverseSize;
>>>>>> if(universeSize == 1) {
>>>>>> std::cerr << "cannot start slaves... not enough nodes" << std::endl;
>>>>>> }
>>>>>
>>>>> The universe size is set to 1 on a singleton because the attribute gets
>>>>> set at the beginning of time - we haven't any way to go back and change
>>>>> it. The sequence of events explains why. The singleton starts up and sets
>>>>> its attributes, including universe_size. It also spins off an orte daemon
>>>>> to act as its own private "mpirun" in case you call comm_spawn. At this
>>>>> point, however, no hostfile has been read - the singleton is just an MPI
>>>>> proc doing its own thing, and the orte daemon is just sitting there on
>>>>> "stand-by".
>>>>>
>>>>> When your app calls comm_spawn, then the orte daemon gets called to
>>>>> launch the new procs. At that time, it (not the original singleton!)
>>>>> reads the hostfile to find out how many nodes are around, and then does
>>>>> the launch.
>>>>>
>>>>> You are trying to check the number of nodes from within the singleton,
>>>>> which won't work - it has no way of discovering that info.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Aug 28, 2012, at 2:38 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>>>
>>>>>>> echo hostsfile
>>>>>> localhost
>>>>>> budgeb-sandybridge
>>>>>>
>>>>>> Thanks,
>>>>>> Brian
>>>>>>
>>>>>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> Hmmm...what is in your "hostsfile"?
>>>>>>>
>>>>>>> On Aug 28, 2012, at 2:33 PM, Brian Budge <brian.bu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Ralph -
>>>>>>>>
>>>>>>>> Thanks for confirming this is possible. I'm trying this and currently
>>>>>>>> failing. Perhaps there's something I'm missing in the code to make
>>>>>>>> this work. Here are the two instantiations and their outputs:
>>>>>>>>
>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe
>>>>>>>> cannot start slaves... not enough nodes
>>>>>>>>
>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib
>>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1
>>>>>>>>> ./master_exe
>>>>>>>> master spawned 1 slaves...
>>>>>>>> slave responding...
>>>>>>>>
>>>>>>>>
>>>>>>>> The code:
>>>>>>>>
>>>>>>>> //master.cpp
>>>>>>>> #include <mpi.h>
>>>>>>>> #include <boost/filesystem.hpp>
>>>>>>>> #include <iostream>
>>>>>>>>
>>>>>>>> int main(int argc, char **args) {
>>>>>>>> int worldSize, universeSize, *puniverseSize, flag;
>>>>>>>>
>>>>>>>> MPI_Comm everyone; //intercomm
>>>>>>>> boost::filesystem::path curPath =
>>>>>>>> boost::filesystem::absolute(boost::filesystem::current_path());
>>>>>>>>
>>>>>>>> std::string toRun = (curPath / "slave_exe").string();
>>>>>>>>
>>>>>>>> int ret = MPI_Init(&argc, &args);
>>>>>>>>
>>>>>>>> if(ret != MPI_SUCCESS) {
>>>>>>>> std::cerr << "failed init" << std::endl;
>>>>>>>> return -1;
>>>>>>>> }
>>>>>>>>
>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
>>>>>>>>
>>>>>>>> if(worldSize != 1) {
>>>>>>>> std::cerr << "too many masters" << std::endl;
>>>>>>>> }
>>>>>>>>
>>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, &flag);
>>>>>>>>
>>>>>>>> if(!flag) {
>>>>>>>> std::cerr << "no universe size" << std::endl;
>>>>>>>> return -1;
>>>>>>>> }
>>>>>>>> universeSize = *puniverseSize;
>>>>>>>> if(universeSize == 1) {
>>>>>>>> std::cerr << "cannot start slaves... not enough nodes" <<
>>>>>>>> std::endl;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> char *buf = (char*)alloca(toRun.size() + 1);
>>>>>>>> memcpy(buf, toRun.c_str(), toRun.size());
>>>>>>>> buf[toRun.size()] = '\0';
>>>>>>>>
>>>>>>>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL,
>>>>>>>> 0, MPI_COMM_SELF, &everyone,
>>>>>>>> MPI_ERRCODES_IGNORE);
>>>>>>>>
>>>>>>>> std::cerr << "master spawned " << universeSize-1 << " slaves..."
>>>>>>>> << std::endl;
>>>>>>>>
>>>>>>>> MPI_Finalize();
>>>>>>>>
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> //slave.cpp
>>>>>>>> #include <mpi.h>
>>>>>>>>
>>>>>>>> int main(int argc, char **args) {
>>>>>>>> int size;
>>>>>>>> MPI_Comm parent;
>>>>>>>> MPI_Init(&argc, &args);
>>>>>>>>
>>>>>>>> MPI_Comm_get_parent(&parent);
>>>>>>>>
>>>>>>>> if(parent == MPI_COMM_NULL) {
>>>>>>>> std::cerr << "slave has no parent" << std::endl;
>>>>>>>> }
>>>>>>>> MPI_Comm_remote_size(parent, &size);
>>>>>>>> if(size != 1) {
>>>>>>>> std::cerr << "parent size is " << size << std::endl;
>>>>>>>> }
>>>>>>>>
>>>>>>>> std::cerr << "slave responding..." << std::endl;
>>>>>>>>
>>>>>>>> MPI_Finalize();
>>>>>>>>
>>>>>>>> return 0;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Any ideas? Thanks for any help.
>>>>>>>>
>>>>>>>> Brian
>>>>>>>>
>>>>>>>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>> wrote:
>>>>>>>>> It really is just that simple :-)
>>>>>>>>>
>>>>>>>>> On Aug 22, 2012, at 8:56 AM, Brian Budge <brian.bu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Okay. Is there a tutorial or FAQ for setting everything up? Or is
>>>>>>>>>> it
>>>>>>>>>> really just that simple? I don't need to run a copy of the orte
>>>>>>>>>> server somewhere?
>>>>>>>>>>
>>>>>>>>>> if my current ip is 192.168.0.1,
>>>>>>>>>>
>>>>>>>>>> 0 > echo 192.168.0.11 > /tmp/hostfile
>>>>>>>>>> 1 > echo 192.168.0.12 >> /tmp/hostfile
>>>>>>>>>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile
>>>>>>>>>> 3 > ./mySpawningExe
>>>>>>>>>>
>>>>>>>>>> At this point, mySpawningExe will be the master, running on
>>>>>>>>>> 192.168.0.1, and I can have spawned, for example, childExe on
>>>>>>>>>> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and
>>>>>>>>>> childExe2 on 192.168.0.12?
>>>>>>>>>>
>>>>>>>>>> Thanks for the help.
>>>>>>>>>>
>>>>>>>>>> Brian
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain <r...@open-mpi.org>
>>>>>>>>>> wrote:
>>>>>>>>>>> Sure, that's still true on all 1.3 or above releases. All you need
>>>>>>>>>>> to do is set the hostfile envar so we pick it up:
>>>>>>>>>>>
>>>>>>>>>>> OMPI_MCA_orte_default_hostfile=<foo>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Aug 21, 2012, at 7:23 PM, Brian Budge <brian.bu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi. I know this is an old thread, but I'm curious if there are any
>>>>>>>>>>>> tutorials describing how to set this up? Is this still available
>>>>>>>>>>>> on
>>>>>>>>>>>> newer open mpi versions?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Brian
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain <r...@lanl.gov>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> Hi Elena
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm copying this to the user list just to correct a mis-statement
>>>>>>>>>>>>> on my part
>>>>>>>>>>>>> in an earlier message that went there. I had stated that a
>>>>>>>>>>>>> singleton could
>>>>>>>>>>>>> comm_spawn onto other nodes listed in a hostfile by setting an
>>>>>>>>>>>>> environmental
>>>>>>>>>>>>> variable that pointed us to the hostfile.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is incorrect in the 1.2 code series. That series does not
>>>>>>>>>>>>> allow
>>>>>>>>>>>>> singletons to read a hostfile at all. Hence, any comm_spawn done
>>>>>>>>>>>>> by a
>>>>>>>>>>>>> singleton can only launch child processes on the singleton's
>>>>>>>>>>>>> local host.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This situation has been corrected for the upcoming 1.3 code
>>>>>>>>>>>>> series. For the
>>>>>>>>>>>>> 1.2 series, though, you will have to do it via an mpirun command
>>>>>>>>>>>>> line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the confusion - I sometimes have too many code families
>>>>>>>>>>>>> to keep
>>>>>>>>>>>>> straight in this old mind!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 1/4/08 5:10 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you very much for the explanations.
>>>>>>>>>>>>>> But I still do not get it running...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the case
>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host
>>>>>>>>>>>>>> my_master.exe
>>>>>>>>>>>>>> everything works.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For the case
>>>>>>>>>>>>>> ./my_master.exe
>>>>>>>>>>>>>> it does not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I did:
>>>>>>>>>>>>>> - create my_hostfile and put it in the $HOME/.openmpi/components/
>>>>>>>>>>>>>> my_hostfile :
>>>>>>>>>>>>>> bollenstreek slots=2 max_slots=3
>>>>>>>>>>>>>> octocore01 slots=8 max_slots=8
>>>>>>>>>>>>>> octocore02 slots=8 max_slots=8
>>>>>>>>>>>>>> clstr000 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr001 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr002 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr003 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr004 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr005 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr006 slots=2 max_slots=3
>>>>>>>>>>>>>> clstr007 slots=2 max_slots=3
>>>>>>>>>>>>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in
>>>>>>>>>>>>>> .tcshrc and
>>>>>>>>>>>>>> then source .tcshrc)
>>>>>>>>>>>>>> - in my_master.cpp I did
>>>>>>>>>>>>>> MPI_Info info1;
>>>>>>>>>>>>>> MPI_Info_create(&info1);
>>>>>>>>>>>>>> char* hostname =
>>>>>>>>>>>>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
>>>>>>>>>>>>>> MPI_Info_set(info1, "host", hostname);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1,
>>>>>>>>>>>>>> 0,
>>>>>>>>>>>>>> MPI_ERRCODES_IGNORE);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - After I call the executable, I've got this error message
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> bollenstreek: > ./my_master
>>>>>>>>>>>>>> number of processes to run: 1
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> Some of the requested hosts are not included in the current
>>>>>>>>>>>>>> allocation for
>>>>>>>>>>>>>> the application:
>>>>>>>>>>>>>> ./childexe
>>>>>>>>>>>>>> The requested hosts were:
>>>>>>>>>>>>>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Verify that you have mapped the allocated resources properly
>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>> --host specification.
>>>>>>>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> base/rmaps_base_support_fns.c at line 225
>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> rmaps_rr.c at line 478
>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> base/rmaps_base_map_job.c at line 210
>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> rmgr_urm.c at line 372
>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> communicator/comm_dyn.c at line 608
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did I miss something?
>>>>>>>>>>>>>> Thanks for help!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov]
>>>>>>>>>>>>>> Sent: Tuesday, December 18, 2007 3:50 PM
>>>>>>>>>>>>>> To: Elena Zhebel; Open MPI Users <us...@open-mpi.org>
>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster
>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/18/07 7:35 AM, "Elena Zhebel" <ezhe...@fugro-jason.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks a lot! Now it works!
>>>>>>>>>>>>>>> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and
>>>>>>>>>>>>>>> pass
>>>>>>>>>>>>>> MPI_Info
>>>>>>>>>>>>>>> Key to the Spawn function!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One more question: is it necessary to start my "master" program
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host
>>>>>>>>>>>>>>> my_master.exe ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, it isn't necessary - assuming that my_master_host is the
>>>>>>>>>>>>>> first host
>>>>>>>>>>>>>> listed in your hostfile! If you are only executing one
>>>>>>>>>>>>>> my_master.exe (i.e.,
>>>>>>>>>>>>>> you gave -n 1 to mpirun), then we will automatically map that
>>>>>>>>>>>>>> process onto
>>>>>>>>>>>>>> the first host in your hostfile.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you want my_master.exe to go on someone other than the first
>>>>>>>>>>>>>> host in the
>>>>>>>>>>>>>> file, then you have to give us the -host option.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there other possibilities for easy start?
>>>>>>>>>>>>>>> I would say just to run ./my_master.exe , but then the master
>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> know about the available in the network hosts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can set the hostfile parameter in your environment instead
>>>>>>>>>>>>>> of on the
>>>>>>>>>>>>>> command line. Just set OMPI_MCA_rds_hostfile_path = my.hosts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can then just run ./my_master.exe on the host where you want
>>>>>>>>>>>>>> the master
>>>>>>>>>>>>>> to reside - everything should work the same.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just as an FYI: the name of that environmental variable is going
>>>>>>>>>>>>>> to change
>>>>>>>>>>>>>> in the 1.3 release, but everything will still work the same.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov]
>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 5:49 PM
>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>; Elena Zhebel
>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster
>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/17/07 8:19 AM, "Elena Zhebel" <ezhe...@fugro-jason.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Ralph,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you for your answer.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm using OpenMPI 1.2.3. , compiler glibc232, Linux Suse 10.0.
>>>>>>>>>>>>>>>> My "master" executable runs only on the one local host, then
>>>>>>>>>>>>>>>> it spawns
>>>>>>>>>>>>>>>> "slaves" (with MPI::Intracomm::Spawn).
>>>>>>>>>>>>>>>> My question was: how to determine the hosts where these
>>>>>>>>>>>>>>>> "slaves" will be
>>>>>>>>>>>>>>>> spawned?
>>>>>>>>>>>>>>>> You said: "You have to specify all of the hosts that can be
>>>>>>>>>>>>>>>> used by
>>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>>> in the original hostfile". How can I specify the host file? I
>>>>>>>>>>>>>>>> can not
>>>>>>>>>>>>>>>> find it
>>>>>>>>>>>>>>>> in the documentation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hmmm...sorry about the lack of documentation. I always assumed
>>>>>>>>>>>>>>> that the MPI
>>>>>>>>>>>>>>> folks in the project would document such things since it has
>>>>>>>>>>>>>>> little to do
>>>>>>>>>>>>>>> with the underlying run-time, but I guess that fell through the
>>>>>>>>>>>>>>> cracks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are two parts to your question:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. how to specify the hosts to be used for the entire job. I
>>>>>>>>>>>>>>> believe that
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>> somewhat covered here:
>>>>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That FAQ tells you what a hostfile should look like, though you
>>>>>>>>>>>>>>> may already
>>>>>>>>>>>>>>> know that. Basically, we require that you list -all- of the
>>>>>>>>>>>>>>> nodes that both
>>>>>>>>>>>>>>> your master and slave programs will use.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. how to specify which nodes are available for the master, and
>>>>>>>>>>>>>>> which for
>>>>>>>>>>>>>>> the slave.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You would specify the host for your master on the mpirun
>>>>>>>>>>>>>>> command line with
>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host
>>>>>>>>>>>>>>> my_master.exe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This directs Open MPI to map that specified executable on the
>>>>>>>>>>>>>>> specified
>>>>>>>>>>>>>> host
>>>>>>>>>>>>>>> - note that my_master_host must have been in my_hostfile.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Inside your master, you would create an MPI_Info key "host"
>>>>>>>>>>>>>>> that has a
>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>> consisting of a string "host1,host2,host3" identifying the
>>>>>>>>>>>>>>> hosts you want
>>>>>>>>>>>>>>> your slave to execute upon. Those hosts must have been included
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> my_hostfile. Include that key in the MPI_Info array passed to
>>>>>>>>>>>>>>> your Spawn.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We don't currently support providing a hostfile for the slaves
>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>> to the host-at-a-time string above). This may become available
>>>>>>>>>>>>>>> in a future
>>>>>>>>>>>>>>> release - TBD.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hope that helps
>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks and regards,
>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: users-boun...@open-mpi.org
>>>>>>>>>>>>>>>> [mailto:users-boun...@open-mpi.org] On
>>>>>>>>>>>>>>>> Behalf Of Ralph H Castain
>>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 3:31 PM
>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>
>>>>>>>>>>>>>>>> Cc: Ralph H Castain
>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster
>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 12/12/07 5:46 AM, "Elena Zhebel" <ezhe...@fugro-jason.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm working on a MPI application where I'm using OpenMPI
>>>>>>>>>>>>>>>>> instead of
>>>>>>>>>>>>>>>>> MPICH.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In my "master" program I call the function
>>>>>>>>>>>>>>>>> MPI::Intracomm::Spawn which
>>>>>>>>>>>>>>>> spawns
>>>>>>>>>>>>>>>>> "slave" processes. It is not clear for me how to spawn the
>>>>>>>>>>>>>>>>> "slave"
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>> over the network. Currently "master" creates "slaves" on the
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>> host.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If I use 'mpirun --hostfile openmpi.hosts' then processes are
>>>>>>>>>>>>>>>>> spawn
>>>>>>>>>>>>>>>>> over
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> network as expected. But now I need to spawn processes over
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> network
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> my own executable using MPI::Intracomm::Spawn, how can I
>>>>>>>>>>>>>>>>> achieve it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure from your description exactly what you are trying
>>>>>>>>>>>>>>>> to do,
>>>>>>>>>>>>>>>> nor in
>>>>>>>>>>>>>>>> what environment this is all operating within or what version
>>>>>>>>>>>>>>>> of Open
>>>>>>>>>>>>>>>> MPI
>>>>>>>>>>>>>>>> you are using. Setting aside the environment and version
>>>>>>>>>>>>>>>> issue, I'm
>>>>>>>>>>>>>>>> guessing
>>>>>>>>>>>>>>>> that you are running your executable over some specified set
>>>>>>>>>>>>>>>> of hosts,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> want to provide a different hostfile that specifies the hosts
>>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>> used for
>>>>>>>>>>>>>>>> the "slave" processes. Correct?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If that is correct, then I'm afraid you can't do that in any
>>>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>> of Open
>>>>>>>>>>>>>>>> MPI today. You have to specify all of the hosts that can be
>>>>>>>>>>>>>>>> used by
>>>>>>>>>>>>>>>> your job
>>>>>>>>>>>>>>>> in the original hostfile. You can then specify a subset of
>>>>>>>>>>>>>>>> those hosts
>>>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>>>> used by your original "master" program, and then specify a
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> subset
>>>>>>>>>>>>>>>> to be used by the "slaves" when calling Spawn.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But the system requires that you tell it -all- of the hosts
>>>>>>>>>>>>>>>> that are
>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>> to be used at the beginning of the job.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> At the moment, there is no plan to remove that requirement,
>>>>>>>>>>>>>>>> though
>>>>>>>>>>>>>>>> there has
>>>>>>>>>>>>>>>> been occasional discussion about doing so at some point in the
>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>> No
>>>>>>>>>>>>>>>> promises that it will happen, though - managed environments, in
>>>>>>>>>>>>>>>> particular,
>>>>>>>>>>>>>>>> currently object to the idea of changing the allocation
>>>>>>>>>>>>>>>> on-the-fly. We
>>>>>>>>>>>>>>>> may,
>>>>>>>>>>>>>>>> though, make a provision for purely hostfile-based
>>>>>>>>>>>>>>>> environments (i.e.,
>>>>>>>>>>>>>>>> unmanaged) at some time in the future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks in advance for any help.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Elena
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users