Re: [OMPI users] application with mxm hangs on startup
I've tried to launch the application on nodes with QDR Infiniband. The first attempt with 2 processes worked, but the following was printed to the output: [1345633953.436676] [b01:2523 :0] mpool.c:99 MXM ERROR Invalid mempool parameter(s) [1345633953.436676] [b01:2522 :0] mpool.c:99 MXM ERROR Invalid mempool parameter(s) -- MXM was unable to create an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Invalid parameter -- The results from this launch didn't differ from the results of the launch without MXM. Then I've tried to launch it with 256 processes, but got the same message from each process and then the application crashed. After that I'm observing the same behavior as with FDR: application hangs in the beginning. Best regards, Pavel Mezentsev. 2012/8/22 Pavel Mezentsev > Hello! > > I've built openmpi 1.6.1rc3 with support of MXM. But when I try to launch > an application using this mtl it hangs and can't figure out why. > > If I launch it with np below 128 then everything works fine since mxm > isn't used. I've tried setting the threshold to 0 and launching 2 processes > with the same result: hangs on startup. > What could be causing this problem? > > Here is the command I execute: > /opt/openmpi/1.6.1/mxm-test/bin/mpirun \ > -np $NP \ > -hostfile hosts_fdr2 \ > --mca mtl mxm \ > --mca btl ^tcp \ > --mca mtl_mxm_np 0 \ > -x OMP_NUM_THREADS=$NT \ > -x LD_LIBRARY_PATH \ > --bind-to-core \ > -npernode 16 \ > --mca coll_fca_np 0 -mca coll_fca_enable 0 \ > ./IMB-MPI1 -npmin $NP Allreduce Reduce Barrier Bcast > Allgather Allgatherv > > I'm performing the tests on nodes with Intel SB processors and FDR. > Openmpi was configured with the following parameters: > CC=icc CXX=icpc F77=ifort FC=ifort ./configure > --prefix=/opt/openmpi/1.6.1rc3/mxm-test --with-mxm=/opt/mellanox/mxm > --with-fca=/opt/mellanox/fca --with-knem=/usr/share/knem > I'm using the latest ofed from mellanox: 1.5.3-3.1.0 on centos 6.1 with > default kernel: 2.6.32-131.0.15. > The compilation with default mxm (1.0.601) failed so I installed the > latest version from mellanox: 1.1.1227 > > Best regards, Pavel Mezentsev. >
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Sure, that's still true on all 1.3 or above releases. All you need to do is set the hostfile envar so we pick it up: OMPI_MCA_orte_default_hostfile= On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: > Hi. I know this is an old thread, but I'm curious if there are any > tutorials describing how to set this up? Is this still available on > newer open mpi versions? > > Thanks, > Brian > > On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: >> Hi Elena >> >> I'm copying this to the user list just to correct a mis-statement on my part >> in an earlier message that went there. I had stated that a singleton could >> comm_spawn onto other nodes listed in a hostfile by setting an environmental >> variable that pointed us to the hostfile. >> >> This is incorrect in the 1.2 code series. That series does not allow >> singletons to read a hostfile at all. Hence, any comm_spawn done by a >> singleton can only launch child processes on the singleton's local host. >> >> This situation has been corrected for the upcoming 1.3 code series. For the >> 1.2 series, though, you will have to do it via an mpirun command line. >> >> Sorry for the confusion - I sometimes have too many code families to keep >> straight in this old mind! >> >> Ralph >> >> >> On 1/4/08 5:10 AM, "Elena Zhebel" wrote: >> >>> Hello Ralph, >>> >>> Thank you very much for the explanations. >>> But I still do not get it running... >>> >>> For the case >>> mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe >>> everything works. >>> >>> For the case >>> ./my_master.exe >>> it does not. >>> >>> I did: >>> - create my_hostfile and put it in the $HOME/.openmpi/components/ >>> my_hostfile : >>> bollenstreek slots=2 max_slots=3 >>> octocore01 slots=8 max_slots=8 >>> octocore02 slots=8 max_slots=8 >>> clstr000 slots=2 max_slots=3 >>> clstr001 slots=2 max_slots=3 >>> clstr002 slots=2 max_slots=3 >>> clstr003 slots=2 max_slots=3 >>> clstr004 slots=2 max_slots=3 >>> clstr005 slots=2 max_slots=3 >>> clstr006 slots=2 max_slots=3 >>> clstr007 slots=2 max_slots=3 >>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in .tcshrc and >>> then source .tcshrc) >>> - in my_master.cpp I did >>> MPI_Info info1; >>> MPI_Info_create(&info1); >>> char* hostname = >>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02"; >>> MPI_Info_set(info1, "host", hostname); >>> >>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0, >>> MPI_ERRCODES_IGNORE); >>> >>> - After I call the executable, I've got this error message >>> >>> bollenstreek: > ./my_master >>> number of processes to run: 1 >>> -- >>> Some of the requested hosts are not included in the current allocation for >>> the application: >>> ./childexe >>> The requested hosts were: >>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02 >>> >>> Verify that you have mapped the allocated resources properly using the >>> --host specification. >>> -- >>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file >>> base/rmaps_base_support_fns.c at line 225 >>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file >>> rmaps_rr.c at line 478 >>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file >>> base/rmaps_base_map_job.c at line 210 >>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file >>> rmgr_urm.c at line 372 >>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file >>> communicator/comm_dyn.c at line 608 >>> >>> Did I miss something? >>> Thanks for help! >>> >>> Elena >>> >>> >>> -Original Message- >>> From: Ralph H Castain [mailto:r...@lanl.gov] >>> Sent: Tuesday, December 18, 2007 3:50 PM >>> To: Elena Zhebel; Open MPI Users >>> Cc: Ralph H Castain >>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration >>> >>> >>> >>> >>> On 12/18/07 7:35 AM, "Elena Zhebel" wrote: >>> Thanks a lot! Now it works! The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and pass >>> MPI_Info Key to the Spawn function! One more question: is it necessary to start my "master" program with mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe ? >>> >>> No, it isn't necessary - assuming that my_master_host is the first host >>> listed in your hostfile! If you are only executing one my_master.exe (i.e., >>> you gave -n 1 to mpirun), then we will automatically map that process onto >>> the first host in your hostfile. >>> >>> If you want my_master.exe to go on someone other than the first host in the >>> file, then you have to give us the -host option. >>> Are there other possibilities for easy start? I would say just to run ./my_master.exe , but then the master process >>> doesn't know about the available in the net
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
Okay. Is there a tutorial or FAQ for setting everything up? Or is it really just that simple? I don't need to run a copy of the orte server somewhere? if my current ip is 192.168.0.1, 0 > echo 192.168.0.11 > /tmp/hostfile 1 > echo 192.168.0.12 >> /tmp/hostfile 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile 3 > ./mySpawningExe At this point, mySpawningExe will be the master, running on 192.168.0.1, and I can have spawned, for example, childExe on 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and childExe2 on 192.168.0.12? Thanks for the help. Brian On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain wrote: > Sure, that's still true on all 1.3 or above releases. All you need to do is > set the hostfile envar so we pick it up: > > OMPI_MCA_orte_default_hostfile= > > > On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: > >> Hi. I know this is an old thread, but I'm curious if there are any >> tutorials describing how to set this up? Is this still available on >> newer open mpi versions? >> >> Thanks, >> Brian >> >> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: >>> Hi Elena >>> >>> I'm copying this to the user list just to correct a mis-statement on my part >>> in an earlier message that went there. I had stated that a singleton could >>> comm_spawn onto other nodes listed in a hostfile by setting an environmental >>> variable that pointed us to the hostfile. >>> >>> This is incorrect in the 1.2 code series. That series does not allow >>> singletons to read a hostfile at all. Hence, any comm_spawn done by a >>> singleton can only launch child processes on the singleton's local host. >>> >>> This situation has been corrected for the upcoming 1.3 code series. For the >>> 1.2 series, though, you will have to do it via an mpirun command line. >>> >>> Sorry for the confusion - I sometimes have too many code families to keep >>> straight in this old mind! >>> >>> Ralph >>> >>> >>> On 1/4/08 5:10 AM, "Elena Zhebel" wrote: >>> Hello Ralph, Thank you very much for the explanations. But I still do not get it running... For the case mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe everything works. For the case ./my_master.exe it does not. I did: - create my_hostfile and put it in the $HOME/.openmpi/components/ my_hostfile : bollenstreek slots=2 max_slots=3 octocore01 slots=8 max_slots=8 octocore02 slots=8 max_slots=8 clstr000 slots=2 max_slots=3 clstr001 slots=2 max_slots=3 clstr002 slots=2 max_slots=3 clstr003 slots=2 max_slots=3 clstr004 slots=2 max_slots=3 clstr005 slots=2 max_slots=3 clstr006 slots=2 max_slots=3 clstr007 slots=2 max_slots=3 - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in .tcshrc and then source .tcshrc) - in my_master.cpp I did MPI_Info info1; MPI_Info_create(&info1); char* hostname = "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02"; MPI_Info_set(info1, "host", hostname); _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0, MPI_ERRCODES_IGNORE); - After I call the executable, I've got this error message bollenstreek: > ./my_master number of processes to run: 1 -- Some of the requested hosts are not included in the current allocation for the application: ./childexe The requested hosts were: clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02 Verify that you have mapped the allocated resources properly using the --host specification. -- [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file base/rmaps_base_support_fns.c at line 225 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmaps_rr.c at line 478 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file base/rmaps_base_map_job.c at line 210 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file rmgr_urm.c at line 372 [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file communicator/comm_dyn.c at line 608 Did I miss something? Thanks for help! Elena -Original Message- From: Ralph H Castain [mailto:r...@lanl.gov] Sent: Tuesday, December 18, 2007 3:50 PM To: Elena Zhebel; Open MPI Users Cc: Ralph H Castain Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration On 12/18/07 7:35 AM, "Elena Zhebel" wrote: > Thanks a lot! Now it works! > The solution is to use mpirun -n 1 -hostfile my.hosts *.exe and pass MPI_Info > Key to the Spawn function! > > One
Re: [OMPI users] MPI::Intracomm::Spawn and cluster configuration
It really is just that simple :-) On Aug 22, 2012, at 8:56 AM, Brian Budge wrote: > Okay. Is there a tutorial or FAQ for setting everything up? Or is it > really just that simple? I don't need to run a copy of the orte > server somewhere? > > if my current ip is 192.168.0.1, > > 0 > echo 192.168.0.11 > /tmp/hostfile > 1 > echo 192.168.0.12 >> /tmp/hostfile > 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile > 3 > ./mySpawningExe > > At this point, mySpawningExe will be the master, running on > 192.168.0.1, and I can have spawned, for example, childExe on > 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and > childExe2 on 192.168.0.12? > > Thanks for the help. > > Brian > > On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain wrote: >> Sure, that's still true on all 1.3 or above releases. All you need to do is >> set the hostfile envar so we pick it up: >> >> OMPI_MCA_orte_default_hostfile= >> >> >> On Aug 21, 2012, at 7:23 PM, Brian Budge wrote: >> >>> Hi. I know this is an old thread, but I'm curious if there are any >>> tutorials describing how to set this up? Is this still available on >>> newer open mpi versions? >>> >>> Thanks, >>> Brian >>> >>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain wrote: Hi Elena I'm copying this to the user list just to correct a mis-statement on my part in an earlier message that went there. I had stated that a singleton could comm_spawn onto other nodes listed in a hostfile by setting an environmental variable that pointed us to the hostfile. This is incorrect in the 1.2 code series. That series does not allow singletons to read a hostfile at all. Hence, any comm_spawn done by a singleton can only launch child processes on the singleton's local host. This situation has been corrected for the upcoming 1.3 code series. For the 1.2 series, though, you will have to do it via an mpirun command line. Sorry for the confusion - I sometimes have too many code families to keep straight in this old mind! Ralph On 1/4/08 5:10 AM, "Elena Zhebel" wrote: > Hello Ralph, > > Thank you very much for the explanations. > But I still do not get it running... > > For the case > mpirun -n 1 -hostfile my_hostfile -host my_master_host my_master.exe > everything works. > > For the case > ./my_master.exe > it does not. > > I did: > - create my_hostfile and put it in the $HOME/.openmpi/components/ > my_hostfile : > bollenstreek slots=2 max_slots=3 > octocore01 slots=8 max_slots=8 > octocore02 slots=8 max_slots=8 > clstr000 slots=2 max_slots=3 > clstr001 slots=2 max_slots=3 > clstr002 slots=2 max_slots=3 > clstr003 slots=2 max_slots=3 > clstr004 slots=2 max_slots=3 > clstr005 slots=2 max_slots=3 > clstr006 slots=2 max_slots=3 > clstr007 slots=2 max_slots=3 > - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it in .tcshrc and > then source .tcshrc) > - in my_master.cpp I did > MPI_Info info1; > MPI_Info_create(&info1); > char* hostname = > "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02"; > MPI_Info_set(info1, "host", hostname); > > _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, info1, 0, > MPI_ERRCODES_IGNORE); > > - After I call the executable, I've got this error message > > bollenstreek: > ./my_master > number of processes to run: 1 > -- > Some of the requested hosts are not included in the current allocation for > the application: > ./childexe > The requested hosts were: > clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02 > > Verify that you have mapped the allocated resources properly using the > --host specification. > -- > [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > base/rmaps_base_support_fns.c at line 225 > [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > rmaps_rr.c at line 478 > [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > base/rmaps_base_map_job.c at line 210 > [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > rmgr_urm.c at line 372 > [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource in file > communicator/comm_dyn.c at line 608 > > Did I miss something? > Thanks for help! > > Elena > > > -Original Message- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, December 18, 2007 3:50 PM > To: Elena Zhebel; Open MPI Users > Cc: Ralph H Castain > Subject: Re: [OMPI users] MPI::Intracomm::Spawn an