On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 30.01.2009 um 15:02 schrieb Sangamesh B: > >> Dear Open MPI, >> >> Do you have a solution for the following problem of Open MPI (1.3) >> when run through Grid Engine. >> >> I changed global execd params with H_MEMORYLOCKED=infinity and >> restarted the sgeexecd in all nodes. >> >> But still the problem persists: >> >> $cat err.77.CPMD-OMPI >> ssh_exchange_identification: Connection closed by remote host > > I think this might already be the reason why it's not working. A mpihello > program is running fine through SGE? > No.
Any Open MPI parallel job thru SGE runs only if its running on a single node (i.e. 8processes on 8 cores of a single node). If number of processes is more than 8, then SGE will schedule it on 2 nodes - the job will fail with the above error. Now I did a loose integration of Open MPI 1.3 with SGE. The job runs, but all 16 processes run on a single node. $ cat out.83.Hello-OMPI /opt/gridengine/default/spool/node-0-17/active_jobs/83.1/pe_hostfile ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 Greetings: 1 of 16 from the node node-0-17.local Greetings: 10 of 16 from the node node-0-17.local Greetings: 15 of 16 from the node node-0-17.local Greetings: 9 of 16 from the node node-0-17.local Greetings: 14 of 16 from the node node-0-17.local Greetings: 8 of 16 from the node node-0-17.local Greetings: 11 of 16 from the node node-0-17.local Greetings: 12 of 16 from the node node-0-17.local Greetings: 6 of 16 from the node node-0-17.local Greetings: 0 of 16 from the node node-0-17.local Greetings: 5 of 16 from the node node-0-17.local Greetings: 3 of 16 from the node node-0-17.local Greetings: 13 of 16 from the node node-0-17.local Greetings: 4 of 16 from the node node-0-17.local Greetings: 7 of 16 from the node node-0-17.local Greetings: 2 of 16 from the node node-0-17.local But qhost -u <user name> shows that it is scheduled/running on two nodes. Any body successful in running Open MPI 1.3 tightly integrated with SGE? Thanks, Sangamesh > -- Reuti > > >> -------------------------------------------------------------------------- >> A daemon (pid 31947) died unexpectedly with status 129 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> ssh_exchange_identification: Connection closed by remote host >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> node-0-19.local - daemon did not report back when launched >> node-0-20.local - daemon did not report back when launched >> node-0-21.local - daemon did not report back when launched >> node-0-22.local - daemon did not report back when launched >> >> The hostnames for infiniband interfaces are ibc0, ibc1, ibc2 .. ibc23. >> May be Open MPI is not able to identify hosts as it is using node-0-.. >> . Is this causing open mpi to fail? >> >> Thanks, >> Sangamesh >> >> >> On Mon, Jan 26, 2009 at 5:09 PM, mihlon <vacl...@fel.cvut.cz> wrote: >>> >>> Hi, >>> >>>> Hello SGE users, >>>> >>>> The cluster is installed with Rocks-4.3, SGE 6.0 & Open MPI 1.3. >>>> Open MPI is configured with "--with-sge". >>>> ompi_info shows only one component: >>>> # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine >>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) >>>> >>>> Is this acceptable? >>> >>> maybe yes >>> >>> see: http://www.open-mpi.org/faq/?category=building#build-rte-sge >>> >>> shell$ ompi_info | grep gridengine >>> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.3) >>> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.3) >>> >>> (Specific frameworks and version numbers may vary, depending on your >>> version of Open MPI.) >>> >>>> The Open MPI parallel jobs run successfully through command line, but >>>> fail when run thru SGE(with -pe orte <slots>). >>>> >>>> The error is: >>>> >>>> $ cat err.26.Helloworld-PRL >>>> ssh_exchange_identification: Connection closed by remote host >>>> >>>> -------------------------------------------------------------------------- >>>> A daemon (pid 8462) died unexpectedly with status 129 while attempting >>>> to launch so we are aborting. >>>> >>>> There may be more information reported by the environment (see above). >>>> >>>> This may be because the daemon was unable to find all the needed shared >>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>> the >>>> location of the shared libraries on the remote nodes and this will >>>> automatically be forwarded to the remote nodes. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun: clean termination accomplished >>>> >>>> But the same job runs well, if it runs on a single node but with an >>>> error: >>>> >>>> $ cat err.23.Helloworld-PRL >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> >>>> -------------------------------------------------------------------------- >>>> WARNING: There was an error initializing an OpenFabrics device. >>>> >>>> Local host: node-0-4.local >>>> Local device: mthca0 >>>> >>>> -------------------------------------------------------------------------- >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. >>>> This will severely limit memory registrations. >>>> [node-0-4.local:07869] 7 more processes have sent help message >>>> help-mpi-btl-openib.txt / error in device init >>>> [node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to >>>> 0 to see all help / error messages >>>> >>>> The following link explains the same problem: >>>> >>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=72398 >>>> >>>> With this reference, I put 'ulimit -l unlimited' into >>>> /etc/init.d/sgeexecd in all nodes. Restarted the services. >>> >>> Do not set 'ulimit -l unlimited' in /etc/init.d/sgeexecd >>> but set it in the SGE: >>> >>> Run qconf -mconf and set execd_params >>> >>> >>> frontend$> qconf -sconf >>> ... >>> execd_params H_MEMORYLOCKED=infinity >>> ... >>> >>> >>> Then restart all your sgeexecd hosts. >>> >>> >>> Milan >>> >>>> But still the problem persists. >>>> >>>> What could be the way out for this? >>>> >>>> Thanks, >>>> Sangamesh >>>> >>>> ------------------------------------------------------ >>>> >>>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99133 >>>> >>>> To unsubscribe from this discussion, e-mail: >>>> [users-unsubscr...@gridengine.sunsource.net]. >>>> >>> >>> ------------------------------------------------------ >>> >>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=99461 >>> >>> To unsubscribe from this discussion, e-mail: >>> [users-unsubscr...@gridengine.sunsource.net]. >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >