Re: [OMPI users] application with mxm hangs on startup

2012-08-22 Thread Pavel Mezentsev
I've tried to launch the application on nodes with QDR Infiniband. The
first attempt with 2 processes worked, but the following was printed to the
output:
[1345633953.436676] [b01:2523 :0] mpool.c:99   MXM ERROR Invalid
mempool parameter(s)
[1345633953.436676] [b01:2522 :0] mpool.c:99   MXM ERROR Invalid
mempool parameter(s)
--
MXM was unable to create an endpoint. Please make sure that the network
link is
active on the node and the hardware is functioning.

  Error: Invalid parameter

--

The results from this launch didn't differ from the results of the launch
without MXM.

Then I've tried to launch it with 256 processes, but got the same message
from each process and then the application crashed. After that I'm
observing the same behavior as with FDR: application hangs in
the beginning.

Best regards, Pavel Mezentsev.


2012/8/22 Pavel Mezentsev 

> Hello!
>
> I've built openmpi 1.6.1rc3 with support of MXM. But when I try to launch
> an application using this mtl it hangs and can't figure out why.
>
> If I launch it with np below 128 then everything works fine since mxm
> isn't used. I've tried setting the threshold to 0 and launching 2 processes
> with the same result: hangs on startup.
> What could be causing this problem?
>
> Here is the command I execute:
> /opt/openmpi/1.6.1/mxm-test/bin/mpirun \
> -np $NP \
> -hostfile hosts_fdr2 \
> --mca mtl mxm \
> --mca btl ^tcp \
> --mca mtl_mxm_np 0 \
> -x OMP_NUM_THREADS=$NT \
> -x LD_LIBRARY_PATH \
> --bind-to-core \
> -npernode 16 \
> --mca coll_fca_np 0 -mca coll_fca_enable 0 \
> ./IMB-MPI1 -npmin $NP Allreduce Reduce Barrier Bcast
> Allgather Allgatherv
>
> I'm performing the tests on nodes with Intel SB processors and FDR.
> Openmpi was configured with the following parameters:
> CC=icc CXX=icpc F77=ifort FC=ifort ./configure
> --prefix=/opt/openmpi/1.6.1rc3/mxm-test --with-mxm=/opt/mellanox/mxm
> --with-fca=/opt/mellanox/fca --with-knem=/usr/share/knem
> I'm using the latest ofed from mellanox: 1.5.3-3.1.0 on centos 6.1 with
> default kernel: 2.6.32-131.0.15.
> The compilation with default mxm (1.0.601) failed so I installed the
> latest version from mellanox: 1.1.1227
>
> Best regards, Pavel Mezentsev.
>


Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-22 Thread Ralph Castain
Sure, that's still true on all 1.3 or above releases. All you need to do is set 
the hostfile envar so we pick it up:

OMPI_MCA_orte_default_hostfile=


On Aug 21, 2012, at 7:23 PM, Brian Budge  wrote:

> Hi.  I know this is an old thread, but I'm curious if there are any
> tutorials describing how to set this up?  Is this still available on
> newer open mpi versions?
> 
> Thanks,
>  Brian
> 
> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain  wrote:
>> Hi Elena
>> 
>> I'm copying this to the user list just to correct a mis-statement on my part
>> in an earlier message that went there. I had stated that a singleton could
>> comm_spawn onto other nodes listed in a hostfile by setting an environmental
>> variable that pointed us to the hostfile.
>> 
>> This is incorrect in the 1.2 code series. That series does not allow
>> singletons to read a hostfile at all. Hence, any comm_spawn done by a
>> singleton can only launch child processes on the singleton's local host.
>> 
>> This situation has been corrected for the upcoming 1.3 code series. For the
>> 1.2 series, though, you will have to do it via an mpirun command line.
>> 
>> Sorry for the confusion - I sometimes have too many code families to keep
>> straight in this old mind!
>> 
>> Ralph
>> 
>> 
>> On 1/4/08 5:10 AM, "Elena Zhebel"  wrote:
>> 
>>> Hello Ralph,
>>> 
>>> Thank you very much for the explanations.
>>> But I still do not get it running...
>>> 
>>> For the case
>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
>>> everything works.
>>> 
>>> For the case
>>> ./my_master.exe
>>> it does not.
>>> 
>>> I did:
>>> - create my_hostfile and put it in the $HOME/.openmpi/components/
>>>  my_hostfile :
>>> bollenstreek slots=2 max_slots=3
>>> octocore01 slots=8  max_slots=8
>>> octocore02 slots=8  max_slots=8
>>> clstr000 slots=2 max_slots=3
>>> clstr001 slots=2 max_slots=3
>>> clstr002 slots=2 max_slots=3
>>> clstr003 slots=2 max_slots=3
>>> clstr004 slots=2 max_slots=3
>>> clstr005 slots=2 max_slots=3
>>> clstr006 slots=2 max_slots=3
>>> clstr007 slots=2 max_slots=3
>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I  put it in .tcshrc and
>>> then source .tcshrc)
>>> - in my_master.cpp I did
>>>  MPI_Info info1;
>>>  MPI_Info_create(&info1);
>>>  char* hostname =
>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
>>>  MPI_Info_set(info1, "host", hostname);
>>> 
>>>  _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0,
>>> MPI_ERRCODES_IGNORE);
>>> 
>>> - After I call the executable, I've got this error message
>>> 
>>> bollenstreek: > ./my_master
>>> number of processes to run: 1
>>> --
>>> Some of the requested hosts are not included in the current allocation for
>>> the application:
>>>  ./childexe
>>> The requested hosts were:
>>>  clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02
>>> 
>>> Verify that you have mapped the allocated resources properly using the
>>> --host specification.
>>> --
>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>> base/rmaps_base_support_fns.c at line 225
>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>> rmaps_rr.c at line 478
>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>> base/rmaps_base_map_job.c at line 210
>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>> rmgr_urm.c at line 372
>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
>>> communicator/comm_dyn.c at line 608
>>> 
>>> Did I miss something?
>>> Thanks for help!
>>> 
>>> Elena
>>> 
>>> 
>>> -Original Message-
>>> From: Ralph H Castain [mailto:r...@lanl.gov]
>>> Sent: Tuesday, December 18, 2007 3:50 PM
>>> To: Elena Zhebel; Open MPI Users 
>>> Cc: Ralph H Castain
>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
>>> 
>>> 
>>> 
>>> 
>>> On 12/18/07 7:35 AM, "Elena Zhebel"  wrote:
>>> 
 Thanks a lot! Now it works!
 The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and pass
>>> MPI_Info
 Key to the Spawn function!
 
 One more question: is it necessary to start my "master" program with
 mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe ?
>>> 
>>> No, it isn't necessary - assuming that my_master_host is the first host
>>> listed in your hostfile! If you are only executing one my_master.exe (i.e.,
>>> you gave -n 1 to mpirun), then we will automatically map that process onto
>>> the first host in your hostfile.
>>> 
>>> If you want my_master.exe to go on someone other than the first host in the
>>> file, then you have to give us the -host option.
>>> 
 
 Are there other possibilities for easy start?
 I would say just to run ./my_master.exe , but then the master process
>>> doesn't
 know about the available in the net

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-22 Thread Brian Budge
Okay.  Is there a tutorial or FAQ for setting everything up?  Or is it
really just that simple?  I don't need to run a copy of the orte
server somewhere?

if my current ip is 192.168.0.1,

0 > echo 192.168.0.11 > /tmp/hostfile
1 > echo 192.168.0.12 >> /tmp/hostfile
2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile
3 > ./mySpawningExe

At this point, mySpawningExe will be the master, running on
192.168.0.1, and I can have spawned, for example, childExe on
192.168.0.11 and 192.168.0.12?  Or childExe1 on 192.168.0.11 and
childExe2 on 192.168.0.12?

Thanks for the help.

  Brian

On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain  wrote:
> Sure, that's still true on all 1.3 or above releases. All you need to do is 
> set the hostfile envar so we pick it up:
>
> OMPI_MCA_orte_default_hostfile=
>
>
> On Aug 21, 2012, at 7:23 PM, Brian Budge  wrote:
>
>> Hi.  I know this is an old thread, but I'm curious if there are any
>> tutorials describing how to set this up?  Is this still available on
>> newer open mpi versions?
>>
>> Thanks,
>>  Brian
>>
>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain  wrote:
>>> Hi Elena
>>>
>>> I'm copying this to the user list just to correct a mis-statement on my part
>>> in an earlier message that went there. I had stated that a singleton could
>>> comm_spawn onto other nodes listed in a hostfile by setting an environmental
>>> variable that pointed us to the hostfile.
>>>
>>> This is incorrect in the 1.2 code series. That series does not allow
>>> singletons to read a hostfile at all. Hence, any comm_spawn done by a
>>> singleton can only launch child processes on the singleton's local host.
>>>
>>> This situation has been corrected for the upcoming 1.3 code series. For the
>>> 1.2 series, though, you will have to do it via an mpirun command line.
>>>
>>> Sorry for the confusion - I sometimes have too many code families to keep
>>> straight in this old mind!
>>>
>>> Ralph
>>>
>>>
>>> On 1/4/08 5:10 AM, "Elena Zhebel"  wrote:
>>>
 Hello Ralph,

 Thank you very much for the explanations.
 But I still do not get it running...

 For the case
 mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
 everything works.

 For the case
 ./my_master.exe
 it does not.

 I did:
 - create my_hostfile and put it in the $HOME/.openmpi/components/
  my_hostfile :
 bollenstreek slots=2 max_slots=3
 octocore01 slots=8  max_slots=8
 octocore02 slots=8  max_slots=8
 clstr000 slots=2 max_slots=3
 clstr001 slots=2 max_slots=3
 clstr002 slots=2 max_slots=3
 clstr003 slots=2 max_slots=3
 clstr004 slots=2 max_slots=3
 clstr005 slots=2 max_slots=3
 clstr006 slots=2 max_slots=3
 clstr007 slots=2 max_slots=3
 - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I  put it in .tcshrc and
 then source .tcshrc)
 - in my_master.cpp I did
  MPI_Info info1;
  MPI_Info_create(&info1);
  char* hostname =
 "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
  MPI_Info_set(info1, "host", hostname);

  _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0,
 MPI_ERRCODES_IGNORE);

 - After I call the executable, I've got this error message

 bollenstreek: > ./my_master
 number of processes to run: 1
 --
 Some of the requested hosts are not included in the current allocation for
 the application:
  ./childexe
 The requested hosts were:
  clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02

 Verify that you have mapped the allocated resources properly using the
 --host specification.
 --
 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
 base/rmaps_base_support_fns.c at line 225
 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
 rmaps_rr.c at line 478
 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
 base/rmaps_base_map_job.c at line 210
 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
 rmgr_urm.c at line 372
 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
 communicator/comm_dyn.c at line 608

 Did I miss something?
 Thanks for help!

 Elena


 -Original Message-
 From: Ralph H Castain [mailto:r...@lanl.gov]
 Sent: Tuesday, December 18, 2007 3:50 PM
 To: Elena Zhebel; Open MPI Users 
 Cc: Ralph H Castain
 Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration




 On 12/18/07 7:35 AM, "Elena Zhebel"  wrote:

> Thanks a lot! Now it works!
> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and pass
 MPI_Info
> Key to the Spawn function!
>
> One

Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration

2012-08-22 Thread Ralph Castain
It really is just that simple :-)

On Aug 22, 2012, at 8:56 AM, Brian Budge  wrote:

> Okay.  Is there a tutorial or FAQ for setting everything up?  Or is it
> really just that simple?  I don't need to run a copy of the orte
> server somewhere?
> 
> if my current ip is 192.168.0.1,
> 
> 0 > echo 192.168.0.11 > /tmp/hostfile
> 1 > echo 192.168.0.12 >> /tmp/hostfile
> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile
> 3 > ./mySpawningExe
> 
> At this point, mySpawningExe will be the master, running on
> 192.168.0.1, and I can have spawned, for example, childExe on
> 192.168.0.11 and 192.168.0.12?  Or childExe1 on 192.168.0.11 and
> childExe2 on 192.168.0.12?
> 
> Thanks for the help.
> 
>  Brian
> 
> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain  wrote:
>> Sure, that's still true on all 1.3 or above releases. All you need to do is 
>> set the hostfile envar so we pick it up:
>> 
>> OMPI_MCA_orte_default_hostfile=
>> 
>> 
>> On Aug 21, 2012, at 7:23 PM, Brian Budge  wrote:
>> 
>>> Hi.  I know this is an old thread, but I'm curious if there are any
>>> tutorials describing how to set this up?  Is this still available on
>>> newer open mpi versions?
>>> 
>>> Thanks,
>>> Brian
>>> 
>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain  wrote:
 Hi Elena
 
 I'm copying this to the user list just to correct a mis-statement on my 
 part
 in an earlier message that went there. I had stated that a singleton could
 comm_spawn onto other nodes listed in a hostfile by setting an 
 environmental
 variable that pointed us to the hostfile.
 
 This is incorrect in the 1.2 code series. That series does not allow
 singletons to read a hostfile at all. Hence, any comm_spawn done by a
 singleton can only launch child processes on the singleton's local host.
 
 This situation has been corrected for the upcoming 1.3 code series. For the
 1.2 series, though, you will have to do it via an mpirun command line.
 
 Sorry for the confusion - I sometimes have too many code families to keep
 straight in this old mind!
 
 Ralph
 
 
 On 1/4/08 5:10 AM, "Elena Zhebel"  wrote:
 
> Hello Ralph,
> 
> Thank you very much for the explanations.
> But I still do not get it running...
> 
> For the case
> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe
> everything works.
> 
> For the case
> ./my_master.exe
> it does not.
> 
> I did:
> - create my_hostfile and put it in the $HOME/.openmpi/components/
> my_hostfile :
> bollenstreek slots=2 max_slots=3
> octocore01 slots=8  max_slots=8
> octocore02 slots=8  max_slots=8
> clstr000 slots=2 max_slots=3
> clstr001 slots=2 max_slots=3
> clstr002 slots=2 max_slots=3
> clstr003 slots=2 max_slots=3
> clstr004 slots=2 max_slots=3
> clstr005 slots=2 max_slots=3
> clstr006 slots=2 max_slots=3
> clstr007 slots=2 max_slots=3
> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I  put it in .tcshrc and
> then source .tcshrc)
> - in my_master.cpp I did
> MPI_Info info1;
> MPI_Info_create(&info1);
> char* hostname =
> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02";
> MPI_Info_set(info1, "host", hostname);
> 
> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0,
> MPI_ERRCODES_IGNORE);
> 
> - After I call the executable, I've got this error message
> 
> bollenstreek: > ./my_master
> number of processes to run: 1
> --
> Some of the requested hosts are not included in the current allocation for
> the application:
> ./childexe
> The requested hosts were:
> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02
> 
> Verify that you have mapped the allocated resources properly using the
> --host specification.
> --
> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> base/rmaps_base_support_fns.c at line 225
> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> rmaps_rr.c at line 478
> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> base/rmaps_base_map_job.c at line 210
> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> rmgr_urm.c at line 372
> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file
> communicator/comm_dyn.c at line 608
> 
> Did I miss something?
> Thanks for help!
> 
> Elena
> 
> 
> -Original Message-
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, December 18, 2007 3:50 PM
> To: Elena Zhebel; Open MPI Users 
> Cc: Ralph H Castain
> Subject: Re: [OMPI users] MPI::Intracomm::Spawn an